[Impala-CR](cdh5-trunk) PREVIEW: Basic column-wise slot materialization in Parquet scanner.

Tim Armstrong (Code Review) Tue, 19 Apr 2016 10:10:44 -0700

Tim Armstrong has posted comments on this change.

Change subject: PREVIEW: Basic column-wise slot materialization in Parquet 
scanner.
......................................................................



Patch Set 2:

(10 comments)

Some initial comments. I'm not sure that I fully understand some aspects of the 
changes so I'll have to do another pass.

http://gerrit.cloudera.org:8080/#/c/2779/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 273:   /// once. Returns the number of values actually materialized in 
*num_values.
Can you document how errors are handled in these functions. E.g. what happens 
if it hits an error partway through the batch.

The documentation was missing before but I think it's even more important now.


Line 706: RowGroupAtEnd
Is this somewhat redundant? We should know the # of rows so can compute 
max_values as min(max_values, rows left in group). And if there's an error 
continue_execution should take care of that.

I figure you'll probably get to that when doing the perf optimisations but it 
would be good now to remove any redundancy in the logic so it's easier to 
understand.


Line 718: val_count
Maybe convert the loop into for (int val_count = 0; val_count < max_values; 
++val_count) and just break out on exception conditions. It will make it easier 
to streamline the loop down the track I think.

I think the LIKELY annotation should be redundant (if not already) if the loop 
is in a reasonably canonical form.


Line 953: HdfsParquetScanner
Does this need to be nested inside the scanner? It seems unnecessary if it's 
local to the .cc file.


Line 979: state->batch_size() * tuple_byte_size
The cast is in the wrong place to prevent overflow.

I think you want to do a similar calculation to what I added in a recent patch 
to compute the batch size and allocate the memory. It probably makes sense to 
change the RowBatch interface so the logic to size the batch is the same 
between all the scanners:

    IMPALA-3105: avoid overrunning allocated tuple buffer


Line 1752:     if (UNLIKELY(
It seems easy to hoist these checks out of the hot loop and should make the 
code more readable, so maybe consider doing that in this patch set.


Line 1912:           ReadCollectionItem(column_readers, materialize_tuple, 
pool, tuple);
Would it make sense to switch this over to the column-wise version now? I know 
you're probably doing it eventually but it might make the intermediate state of 
the code easier to understand if it's not a mix of column-wise and row-wise 
materialisation.


Line 1917:         if (ExecNode::EvalConjuncts(&conjunct_ctxs[0], 
conjunct_ctxs.size(), row)) {
Can we have conjuncts on nested collection items?


Line 2564:       return Status::OK();
?


Line 2893:       return Status::OK();
?


-- 
To view, visit http://gerrit.cloudera.org:8080/2779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I72a613fa805c542e39df20588fb25c57b5f139aa
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Alex Behm <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Mostafa Mokhtar <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) PREVIEW: Basic column-wise slot materialization in Parquet scanner.

Reply via email to