Alex Behm has posted comments on this change.

Change subject: IMPALA-2736: Basic column-wise slot materialization in Parquet 
scanner.
......................................................................


Patch Set 5:

(7 comments)

Tim, I needed to make a fix to TransferScratchTuples(). The new/faster version 
had a division by zero problem that is now fixed.

http://gerrit.cloudera.org:8080/#/c/2779/5/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 367:   bool ReadValueBatch(MemPool* pool, int max_values, int tuple_size,
> Consider exposing this templated function directly and getting rid of the t
In AssembleRows() we call Read*ValueBatch() on a ColumnReader* and not a 
concrete instance, so I think we need a virtual function, and
hence the wrappers. If we wanted to expose this templated function directly, 
we'd need a virtual templated function (which doesn't exist).

The other alternative I have considered is to separate the 
BaseScalarColumnReaders from the CollectionColumnReaders in AssembleRows() and 
then do column-wise materialization for the first, and for the produced rows, 
do row-size materialziation for the latter.

I opted for the current solution to have a uniformity in how all top-level 
slots are populated, but I could go either way.

Thoughts?


Line 756:       uint8_t* tuple_mem, int* num_values) {
> You never call this function right? I would delete this, it's confusing and
This function is called, see the virtual wrappers in this class. That's where 
the perf improvement comes from in this patch because we avoid the virtual 
calls to NextLevels() and ReadValue().


Line 1177:       parent_->state_->LogError(msg);
> Use LOG_OR_RETURN_ON_ERROR? Here and below
Done


Line 1680:         scan_node_->mem_tracker());
> I would consider making scatch_batch a member variable of HdfsParquetScanne
I'm not a fan of keeping everything as members because it makes the code hard 
to reason (side effects vs. function input/outputs). Since we're already deep 
down the member path here, I'm happy to conform. Done.


Line 1727:   // This function most not be called when the output batch is 
already full.
> Maybe further note that this should never happen as long as we always Commi
Done


Line 1842:     // The output batch must have enough room at this point.
> This confused me at first. I would clarify that this is necessary if the ro
Good point. A while loop below seems much clearer, thanks for the suggestion. 
Done.


http://gerrit.cloudera.org:8080/#/c/2779/5/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 464:   /// tuple to be returned in a row batch.
> "to be returned in a row batch" is confusing, if you're trying to define wh
Removed.


-- 
To view, visit http://gerrit.cloudera.org:8080/2779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I72a613fa805c542e39df20588fb25c57b5f139aa
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Alex Behm <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Mostafa Mokhtar <[email protected]>
Gerrit-Reviewer: Skye Wanderman-Milne <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

Reply via email to