[Impala-CR](cdh5-trunk) IMPALA-2736: Basic column-wise slot materialization in Parquet scanner.

Skye Wanderman-Milne (Code Review) Thu, 21 Apr 2016 20:14:01 -0700

Skye Wanderman-Milne has posted comments on this change.

Change subject: IMPALA-2736: Basic column-wise slot materialization in Parquet 
scanner.
......................................................................



Patch Set 5:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/2779/5/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 367:   bool ReadValueBatch(MemPool* pool, int max_values, int tuple_size,
Consider exposing this templated function directly and getting rid of the two 
wrappers, since you call the wrapper functions based on AssembleRows' 
IN_COLLECTION param anyway.


Line 756:       uint8_t* tuple_mem, int* num_values) {
You never call this function right? I would delete this, it's confusing and I 
don't see where you would ever use it (this looks more like a job for codegen 
:)).


Line 1177:       parent_->state_->LogError(msg);
Use LOG_OR_RETURN_ON_ERROR? Here and below


Line 1680:         scan_node_->mem_tracker());
I would consider making scatch_batch a member variable of HdfsParquetScanner, 
similar to batch_, since there's always a single scratch batch (I don't think 
you even need to create new ones per row group, you should be able to Reset()). 
Then you can get rid of a lot of this plumbing, it becomes obvious that there's 
only one scratch batch, and it also becomes obvious where pool allocations are 
coming from (rather than plumbing 'pool' through everywhere).

OTOH this is good setup for nested types.


Line 1727:   // This function most not be called when the output batch is 
already full.
Maybe further note that this should never happen as long as we always 
CommitRows() after calling TransferScratchTuples()


Line 1842:     // The output batch must have enough room at this point.
This confused me at first. I would clarify that this is necessary if the row 
batch didn't have enough space to transfer everything below.

Or, you can move this below the first TransferScratchTuples()/CommitRows() 
call, right? Is there any way we can enter this function with scratch tuples to 
transfer? Or even put the Transfer/Commit in a while loop below, instead of 
depending on them having the same initial capacity.


http://gerrit.cloudera.org:8080/#/c/2779/5/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 464:   /// tuple to be returned in a row batch.
"to be returned in a row batch" is confusing, if you're trying to define what 
top-level tuple means, I'd do something like "the top-level tuple, i.e. tuples 
are materialized into RowBatches and not CollectionValues". Or just remove.


-- 
To view, visit http://gerrit.cloudera.org:8080/2779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I72a613fa805c542e39df20588fb25c57b5f139aa
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Alex Behm <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Mostafa Mokhtar <[email protected]>
Gerrit-Reviewer: Skye Wanderman-Milne <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) IMPALA-2736: Basic column-wise slot materialization in Parquet scanner.

Reply via email to