Alex Behm has posted comments on this change. Change subject: IMPALA-2736: Basic column-wise slot materialization in Parquet scanner. ......................................................................
Patch Set 5: (7 comments) Tim, I needed to make a fix to TransferScratchTuples(). The new/faster version had a division by zero problem that is now fixed. http://gerrit.cloudera.org:8080/#/c/2779/5/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 367: bool ReadValueBatch(MemPool* pool, int max_values, int tuple_size, > Consider exposing this templated function directly and getting rid of the t In AssembleRows() we call Read*ValueBatch() on a ColumnReader* and not a concrete instance, so I think we need a virtual function, and hence the wrappers. If we wanted to expose this templated function directly, we'd need a virtual templated function (which doesn't exist). The other alternative I have considered is to separate the BaseScalarColumnReaders from the CollectionColumnReaders in AssembleRows() and then do column-wise materialization for the first, and for the produced rows, do row-size materialziation for the latter. I opted for the current solution to have a uniformity in how all top-level slots are populated, but I could go either way. Thoughts? Line 756: uint8_t* tuple_mem, int* num_values) { > You never call this function right? I would delete this, it's confusing and This function is called, see the virtual wrappers in this class. That's where the perf improvement comes from in this patch because we avoid the virtual calls to NextLevels() and ReadValue(). Line 1177: parent_->state_->LogError(msg); > Use LOG_OR_RETURN_ON_ERROR? Here and below Done Line 1680: scan_node_->mem_tracker()); > I would consider making scatch_batch a member variable of HdfsParquetScanne I'm not a fan of keeping everything as members because it makes the code hard to reason (side effects vs. function input/outputs). Since we're already deep down the member path here, I'm happy to conform. Done. Line 1727: // This function most not be called when the output batch is already full. > Maybe further note that this should never happen as long as we always Commi Done Line 1842: // The output batch must have enough room at this point. > This confused me at first. I would clarify that this is necessary if the ro Good point. A while loop below seems much clearer, thanks for the suggestion. Done. http://gerrit.cloudera.org:8080/#/c/2779/5/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: Line 464: /// tuple to be returned in a row batch. > "to be returned in a row batch" is confusing, if you're trying to define wh Removed. -- To view, visit http://gerrit.cloudera.org:8080/2779 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I72a613fa805c542e39df20588fb25c57b5f139aa Gerrit-PatchSet: 5 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Alex Behm <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Mostafa Mokhtar <[email protected]> Gerrit-Reviewer: Skye Wanderman-Milne <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
