Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
......................................................................


Patch Set 3:

(7 comments)

The overall approach looks good, had some comments about the details and the 
test coverage.

http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/exec/parquet-column-readers.cc@673
PS3, Line 673:     switch(page_encoding_) {
nit: space before (


http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/exec/parquet-column-readers.cc@1304
PS3, Line 1304:   stringstream ss;
Convert to using Substitute() while you're here - we generally prefer using it 
over stringstream.


http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/util/rle-encoding.h@138
PS3, Line 138:   /// Returns the number of consumed values or 0 if an error 
occurred.
Can you add some tests to be/src/util/rle-test.cc for the new interface? There 
are already some tests there that exercise both GetSingleValue() and the 
NextNum*/Get*Values() interfaces, so we could also add a test for this variant 
of the interface.


http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/util/rle-encoding.h@674
PS3, Line 674:        memset(values + num_consumed, repeated_value, 
num_repeats_to_set);
It's not valid to use memset() here in general since sizeof(T) may be > 1. The 
reason it worked for the Parquet level decoding was because they were always a 
uint8_t.

In general we need a for loop here to do it correctly, but my understanding was 
that memset() is sometimes faster - it was used as an optimisation. I see two 
paths forward here:
1. maybe gcc converts the for loop to a memset() anyway when sizeof(T) is 1 
bytes (at least some compilers can do this conversion if the code fits the 
right pattern).
2. do an if(sizeof(T) == 1) here so that we can use the optimized memset() path 
in that case, then fall back to a for loop otherwise.


http://gerrit.cloudera.org:8080/#/c/9403/3/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/9403/3/testdata/data/README@160
PS3, Line 160: Parquet v1 file with a single RLE encoded boolean column "b". 
Created for IMPALA-6324.
How was it created? Parquet-mr?


http://gerrit.cloudera.org:8080/#/c/9403/3/testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test
File 
testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test:

http://gerrit.cloudera.org:8080/#/c/9403/3/testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test@4
PS3, Line 4: select count(*) from rle_encoded_bool where b;
This wouldn't detect if the boolean values were returned in the wrong order. 
E.g. if we somehow got the bit-packing backwards and the values were in the 
wrong order (we had a long-standing bug with the BIT_PACKED levels).

Can we also add a test with two columns (e.g. id and boolean) and make sure 
that they match up correctly.


http://gerrit.cloudera.org:8080/#/c/9403/3/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/9403/3/tests/query_test/test_scanners.py@656
PS3, Line 656:     """
Nit: trailing """ can go on same line.



--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Comment-Date: Tue, 06 Mar 2018 23:20:10 +0000
Gerrit-HasComments: Yes

Reply via email to