Alex Behm has posted comments on this change.

Change subject: IMPALA-2736: Basic column-wise slot materialization in Parquet 
scanner.
......................................................................


Patch Set 8:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/2779/8/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 329:       uint8_t* tuple_mem, int* num_values) {
> describe role of tuple_mem
Done


Line 369:   bool ReadValueBatch(MemPool* pool, int max_values, int tuple_size,
> instead of having this be templated, why not inline the code at the call si
Based on my experience working with this code, the most difficult part is 
getting all the strange edge cases to work, so duplicating loops/code seems 
like a bad idea to me.

I did what you suggested, but I think it will become more clear in my next 
patch (that is already in CR) that this template is useful for avoiding 
duplication of rather tricky code pieces. We can revisit in the next CR.

I moved the implementations below in this .cc


Line 708:   virtual bool ReadValue(MemPool* pool, Tuple* tuple) {
> do we still need this?
R: Yes. We need it for the row-wise materialization of items in 
CollectionValues.


Line 723:     return ReadValueBatch<false>(pool, max_values, tuple_size, 
tuple_mem, num_values);
> same comment about inlining the call of the templated function applies here
I don't think it's a good idea, but that will only become clear in my next 
patch, so I followed your suggestion, and we can revisit in the next CR.


Line 755:   template<bool IN_COLLECTION>
> since we're now dealing with batches, it would be nice to get rid of the te
IN_COLLECTION is gone for now. Did you have others in mind?


Line 1722:   // This function most not be called when the output batch is 
already full. As long as
> must
Done


Line 1797:     ++stats->total_possible;
> todo: do these stats updates outside the loop if possible, as in stats->tot
Merged with existing TODO. Done.


Line 1801:     // producing an output batch.
> good idea
Done


Line 1874:         continue_execution = col_reader->ReadNonRepeatedValueBatch(
> is there a more graceful name? ReadScalarValueBatch? TopLevelValueBatch?
Neither "scalar" nor "top-level" are discriminative enough. We can materialize 
repeated values into the top-level tuple, e.g.,:

select orderkey from customer.orders;

Happy to change the fn name, but I think the current name is technically 
accurate, and I could not come up with anything better.


http://gerrit.cloudera.org:8080/#/c/2779/8/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 484:   template <bool IN_COLLECTION>
> the comment says top-level tuples, how can in_collection be true? (or: what
IN_COLLECTION is to be understood in the "is in repeated field" sense and not 
in the "is materializing a CollectionValue" sense.
Agree it's confusing. Let's agree on a new name before I change it everywhere.
How about "IN_REPEATED_FIELD"?

I also removed IN_COLLECTION from AssembleRows(). Hopefully this makes the 
meaning a little clearer.

Note that we can materialize values in repeated fields into the top-level 
tuple, e.g.:

select orderkey from customers.orders;


-- 
To view, visit http://gerrit.cloudera.org:8080/2779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I72a613fa805c542e39df20588fb25c57b5f139aa
Gerrit-PatchSet: 8
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Alex Behm <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-Reviewer: Mostafa Mokhtar <[email protected]>
Gerrit-Reviewer: Skye Wanderman-Milne <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

Reply via email to