Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 )
Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner ...................................................................... Patch Set 3: (7 comments) The overall approach looks good, had some comments about the details and the test coverage. http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/exec/parquet-column-readers.cc@673 PS3, Line 673: switch(page_encoding_) { nit: space before ( http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/exec/parquet-column-readers.cc@1304 PS3, Line 1304: stringstream ss; Convert to using Substitute() while you're here - we generally prefer using it over stringstream. http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/util/rle-encoding.h File be/src/util/rle-encoding.h: http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/util/rle-encoding.h@138 PS3, Line 138: /// Returns the number of consumed values or 0 if an error occurred. Can you add some tests to be/src/util/rle-test.cc for the new interface? There are already some tests there that exercise both GetSingleValue() and the NextNum*/Get*Values() interfaces, so we could also add a test for this variant of the interface. http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/util/rle-encoding.h@674 PS3, Line 674: memset(values + num_consumed, repeated_value, num_repeats_to_set); It's not valid to use memset() here in general since sizeof(T) may be > 1. The reason it worked for the Parquet level decoding was because they were always a uint8_t. In general we need a for loop here to do it correctly, but my understanding was that memset() is sometimes faster - it was used as an optimisation. I see two paths forward here: 1. maybe gcc converts the for loop to a memset() anyway when sizeof(T) is 1 bytes (at least some compilers can do this conversion if the code fits the right pattern). 2. do an if(sizeof(T) == 1) here so that we can use the optimized memset() path in that case, then fall back to a for loop otherwise. http://gerrit.cloudera.org:8080/#/c/9403/3/testdata/data/README File testdata/data/README: http://gerrit.cloudera.org:8080/#/c/9403/3/testdata/data/README@160 PS3, Line 160: Parquet v1 file with a single RLE encoded boolean column "b". Created for IMPALA-6324. How was it created? Parquet-mr? http://gerrit.cloudera.org:8080/#/c/9403/3/testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test File testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test: http://gerrit.cloudera.org:8080/#/c/9403/3/testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test@4 PS3, Line 4: select count(*) from rle_encoded_bool where b; This wouldn't detect if the boolean values were returned in the wrong order. E.g. if we somehow got the bit-packing backwards and the values were in the wrong order (we had a long-standing bug with the BIT_PACKED levels). Can we also add a test with two columns (e.g. id and boolean) and make sure that they match up correctly. http://gerrit.cloudera.org:8080/#/c/9403/3/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/9403/3/tests/query_test/test_scanners.py@656 PS3, Line 656: """ Nit: trailing """ can go on same line. -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 3 Gerrit-Owner: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Comment-Date: Tue, 06 Mar 2018 23:20:10 +0000 Gerrit-HasComments: Yes