Skye Wanderman-Milne has posted comments on this change. Change subject: IMPALA-2736: Basic column-wise slot materialization in Parquet scanner. ......................................................................
Patch Set 5: (7 comments) http://gerrit.cloudera.org:8080/#/c/2779/5/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 367: bool ReadValueBatch(MemPool* pool, int max_values, int tuple_size, Consider exposing this templated function directly and getting rid of the two wrappers, since you call the wrapper functions based on AssembleRows' IN_COLLECTION param anyway. Line 756: uint8_t* tuple_mem, int* num_values) { You never call this function right? I would delete this, it's confusing and I don't see where you would ever use it (this looks more like a job for codegen :)). Line 1177: parent_->state_->LogError(msg); Use LOG_OR_RETURN_ON_ERROR? Here and below Line 1680: scan_node_->mem_tracker()); I would consider making scatch_batch a member variable of HdfsParquetScanner, similar to batch_, since there's always a single scratch batch (I don't think you even need to create new ones per row group, you should be able to Reset()). Then you can get rid of a lot of this plumbing, it becomes obvious that there's only one scratch batch, and it also becomes obvious where pool allocations are coming from (rather than plumbing 'pool' through everywhere). OTOH this is good setup for nested types. Line 1727: // This function most not be called when the output batch is already full. Maybe further note that this should never happen as long as we always CommitRows() after calling TransferScratchTuples() Line 1842: // The output batch must have enough room at this point. This confused me at first. I would clarify that this is necessary if the row batch didn't have enough space to transfer everything below. Or, you can move this below the first TransferScratchTuples()/CommitRows() call, right? Is there any way we can enter this function with scratch tuples to transfer? Or even put the Transfer/Commit in a while loop below, instead of depending on them having the same initial capacity. http://gerrit.cloudera.org:8080/#/c/2779/5/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: Line 464: /// tuple to be returned in a row batch. "to be returned in a row batch" is confusing, if you're trying to define what top-level tuple means, I'd do something like "the top-level tuple, i.e. tuples are materialized into RowBatches and not CollectionValues". Or just remove. -- To view, visit http://gerrit.cloudera.org:8080/2779 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I72a613fa805c542e39df20588fb25c57b5f139aa Gerrit-PatchSet: 5 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Alex Behm <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Mostafa Mokhtar <[email protected]> Gerrit-Reviewer: Skye Wanderman-Milne <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
