Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 )
Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner ...................................................................... Patch Set 2: (10 comments) http://gerrit.cloudera.org:8080/#/c/9403/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9403/2//COMMIT_MSG@24 PS2, Line 24: query into it nit: query from it? http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/exec/parquet-column-readers.cc@185 PS2, Line 185: return num_cached_levels > 0; instead? return num_cached_levels != NULL num_cached_levels > 0 doesn't indicate for me that num_cached_levels is a pointer. http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/exec/parquet-column-readers.cc@665 PS2, Line 665: switch(page_encoding_) { Would it have any impact if you did a reset on both bool_values_ and rle_decoder_ here unconditionally? If this has to stay, could you enhance the function's comment to reflect? As I see you branch on page_encoding_ anyway where you use them. http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/exec/parquet-column-readers.cc@714 PS2, Line 714: num_unpacked_values_ = page_encoding_ == parquet::Encoding::PLAIN nit: I find a simple if more readable than the ternary operator. Might be just my preference, though. http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/util/rle-encoding.h File be/src/util/rle-encoding.h: http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/util/rle-encoding.h@139 PS2, Line 139: int32_t GetValues(int32_t num_values_to_consume, T* values); As I see the return value here serves 2 different purposes: 1) As a general return value to indicate success or failure 2) An out parameter to get the num_values_to_consume. I find this a bit confusing. Can't you have a return value for the purpose of 1) and an out param for 2) ? What do you think? http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/util/rle-encoding.h@666 PS2, Line 666: int32_t num_unpacked_values = 0; The names of num_unpacked_values and num_values_to_consume are not in line in my opinion. Either the first one should be num_consumed_values or the second one num_values_to_unpack. http://gerrit.cloudera.org:8080/#/c/9403/2/testdata/data/README File testdata/data/README: http://gerrit.cloudera.org:8080/#/c/9403/2/testdata/data/README@152 PS2, Line 152: rle_encoded_bool.parquet: According to parquet-rle-encoded-bool.test there are 140 True values in column b. Could you please mention in the comment this? (and the total row count would also seem reasonable for future use). http://gerrit.cloudera.org:8080/#/c/9403/2/testdata/data/README@154 PS2, Line 154: as I found no other way to force RLE : encoding of booleans in Parquet v1 nit: This part of the sentence doesn't provide extra information for the reader in my opinion. http://gerrit.cloudera.org:8080/#/c/9403/2/testdata/data/README@155 PS2, Line 155: It became the default encoding in Parquet v2, but : Impala is not able to read Parquet v2 files yet. nit: I'm not sure this information is required here either. (what if someone reads this after Parquet v2 reading is implemented in Impala?) http://gerrit.cloudera.org:8080/#/c/9403/2/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/9403/2/tests/query_test/test_scanners.py@649 PS2, Line 649: "refresh {0}.rle_encoded_bool".format(unique_database)); For my understanding: Why is this refresh needed here? I checked other tests that include .parquet files (e.g. test_zero_rows) but those doesn't do this refresh step. -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 2 Gerrit-Owner: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Mon, 05 Mar 2018 16:04:02 +0000 Gerrit-HasComments: Yes
