Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
......................................................................


Patch Set 6:

(4 comments)

Only very minor comments. I'm curious what was wrong with the decoding - I 
didn't see an obvious bugfix between PS3 and this patchset.

http://gerrit.cloudera.org:8080/#/c/9403/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9403/6//COMMIT_MSG@17
PS6, Line 17: filled. As the cache size is 128, I considered this to be outside 
the
I agree that this overhead seems minimal - I'm not concerned about a 
predictable branch per 128 values.


http://gerrit.cloudera.org:8080/#/c/9403/4/be/src/benchmarks/rle-benchmark.cc
File be/src/benchmarks/rle-benchmark.cc:

http://gerrit.cloudera.org:8080/#/c/9403/4/be/src/benchmarks/rle-benchmark.cc@39
PS4, Line 39: //           for loop / run length: 10                0.4      
0.4      0.4     0.487X     0.487X     0.486X
            : //             memset / run length: 10              0.396      
0.4      0.4     0.482X     0.487X     0.486X
> Some thoughts about this performance degradation compared to the run_length
I agree that it would make sense to pick the denser encoding in cases where the 
literal run is actually smaller. Please feel free to file a JIRA and fix it.

I'm not sure that the performance difference will hold up permanently, since we 
can optimise handling of repeated runs better in the caller and avoid unpacking 
the N values.


http://gerrit.cloudera.org:8080/#/c/9403/4/be/src/benchmarks/rle-benchmark.cc@60
PS4, Line 60: std::
nit: we generally prefer "using" declarations in .cc files instead of using the 
namespace everywhere.


http://gerrit.cloudera.org:8080/#/c/9403/6/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/9403/6/be/src/util/rle-encoding.h@611
PS6, Line 611:     literal_count_ = (indicator_value >> 1) * 8;
I remember you pointed out on another patchset that, according to the grammar, 
this could technically be 0 and still be a valid literal run. Should we leave a 
comment explaining that while we're here?

https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3



--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 6
Gerrit-Owner: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Comment-Date: Tue, 13 Mar 2018 18:55:41 +0000
Gerrit-HasComments: Yes

Reply via email to