[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

Csaba Ringhofer (Code Review) Mon, 05 Mar 2018 11:21:23 -0800

Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )


Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
......................................................................


Patch Set 2:

(10 comments)

Sorry, I forgot that I have already rebased locally, so the rebase and the 
answers for Gabor's comments are mixed in Patch set 2.

http://gerrit.cloudera.org:8080/#/c/9403/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9403/2//COMMIT_MSG@24
PS2, Line 24: query into it
> nit: query from it?
Done


http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/exec/parquet-column-readers.cc@185
PS2, Line 185:   return num_cached_levels > 0;
> instead?
Ouch, this was a bug, thanks for spotting it!


http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/exec/parquet-column-readers.cc@665
PS2, Line 665:     switch(page_encoding_) {
> Would it have any impact if you did a reset on both bool_values_ and rle_de
I have added a comment about this.
Initializing both would not be better in my opinion, as I see no "sane" way to 
initialize the "other" encoder. From a code readability standpoint, the best 
would be to create a common virtual interface for these two decoders, and 
construct a proper implementation on every new data page, but this could cause 
performance regression.


http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/exec/parquet-column-readers.cc@714
PS2, Line 714:       num_unpacked_values_ = page_encoding_ == 
parquet::Encoding::PLAIN
> nit: I find a simple if more readable than the ternary operator. Might be j
Done


http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/util/rle-encoding.h@139
PS2, Line 139:   int32_t GetValues(int32_t num_values_to_consume, T* values);
> As I see the return value here serves 2 different purposes:
I see your point, but I would like to keep it this way for now - the original 
idea was to keep the signature similar to BatchedBitReader::UnpackBatch().


http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/util/rle-encoding.h@666
PS2, Line 666:   int32_t num_unpacked_values = 0;
> The names of num_unpacked_values and num_values_to_consume are not in line
I have replaced num_unpacked_values with num_consumed, as that name is used for 
the same role in several functions in this file.


http://gerrit.cloudera.org:8080/#/c/9403/2/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/9403/2/testdata/data/README@152
PS2, Line 152: rle_encoded_bool.parquet:
> According to parquet-rle-encoded-bool.test there are 140 True values in col
Done


http://gerrit.cloudera.org:8080/#/c/9403/2/testdata/data/README@154
PS2, Line 154: as I found no other way to force RLE
             : encoding of booleans in Parquet v1
> nit: This part of the sentence doesn't provide extra information for the re
Agree, the same information is in the commit message.


http://gerrit.cloudera.org:8080/#/c/9403/2/testdata/data/README@155
PS2, Line 155: It became the default encoding in Parquet v2, but
             : Impala is not able to read Parquet v2 files yet.
> nit: I'm not sure this information is required here either. (what if someon
Done


http://gerrit.cloudera.org:8080/#/c/9403/2/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/9403/2/tests/query_test/test_scanners.py@649
PS2, Line 649:         "refresh {0}.rle_encoded_bool".format(unique_database));
> For my understanding: Why is this refresh needed here? I checked other test
I have copy-pasted the template of this functions from test_def_levels, which 
uses REFRESH for some reason. :)
In these tests REFRESH is not necessary, because there is no SELECT before the 
file copy, so impalad does not have a cached list of underlying files. 
Generally REFRESH is needed to recognize files added via HDFS copy or Hive, but 
I agree that it is better to skip it in these tests to make them faster.

Note that most of these tests duplicate some common steps (create table and 
copy data into its HDFS dir from testdata/data). It could be nice create a 
functions for these steps and do some refactor.



--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 2
Gerrit-Owner: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Comment-Date: Mon, 05 Mar 2018 19:13:53 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

Reply via email to