Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
......................................................................


Patch Set 2:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/9403/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9403/2//COMMIT_MSG@24
PS2, Line 24: query into it
nit: query from it?


http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/exec/parquet-column-readers.cc@185
PS2, Line 185:   return num_cached_levels > 0;
instead?
return num_cached_levels != NULL

num_cached_levels > 0 doesn't indicate for me that num_cached_levels is a 
pointer.


http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/exec/parquet-column-readers.cc@665
PS2, Line 665:     switch(page_encoding_) {
Would it have any impact if you did a reset on both bool_values_ and 
rle_decoder_ here unconditionally?
If this has to stay, could you enhance the function's comment to reflect?

As I see you branch on page_encoding_ anyway where you use them.


http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/exec/parquet-column-readers.cc@714
PS2, Line 714:       num_unpacked_values_ = page_encoding_ == 
parquet::Encoding::PLAIN
nit: I find a simple if more readable than the ternary operator. Might be just 
my preference, though.


http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/util/rle-encoding.h@139
PS2, Line 139:   int32_t GetValues(int32_t num_values_to_consume, T* values);
As I see the return value here serves 2 different purposes:
 1) As a general return value to indicate success or failure
 2) An out parameter to get the num_values_to_consume.
I find this a bit confusing. Can't you have a return value for the purpose of 
1) and an out param for 2) ?
What do you think?


http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/util/rle-encoding.h@666
PS2, Line 666:   int32_t num_unpacked_values = 0;
The names of num_unpacked_values and num_values_to_consume are not in line in 
my opinion. Either the first one should be num_consumed_values or the second 
one num_values_to_unpack.


http://gerrit.cloudera.org:8080/#/c/9403/2/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/9403/2/testdata/data/README@152
PS2, Line 152: rle_encoded_bool.parquet:
According to parquet-rle-encoded-bool.test there are 140 True values in column 
b. Could you please mention in the comment this? (and the total row count would 
also seem reasonable for future use).


http://gerrit.cloudera.org:8080/#/c/9403/2/testdata/data/README@154
PS2, Line 154: as I found no other way to force RLE
             : encoding of booleans in Parquet v1
nit: This part of the sentence doesn't provide extra information for the reader 
in my opinion.


http://gerrit.cloudera.org:8080/#/c/9403/2/testdata/data/README@155
PS2, Line 155: It became the default encoding in Parquet v2, but
             : Impala is not able to read Parquet v2 files yet.
nit: I'm not sure this information is required here either. (what if someone 
reads this after Parquet v2 reading is implemented in Impala?)


http://gerrit.cloudera.org:8080/#/c/9403/2/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/9403/2/tests/query_test/test_scanners.py@649
PS2, Line 649:         "refresh {0}.rle_encoded_bool".format(unique_database));
For my understanding: Why is this refresh needed here? I checked other tests 
that include .parquet files (e.g. test_zero_rows) but those doesn't do this 
refresh step.



--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 2
Gerrit-Owner: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Mon, 05 Mar 2018 16:04:02 +0000
Gerrit-HasComments: Yes

Reply via email to