Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9403 )
Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner ...................................................................... Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/9403/4/be/src/benchmarks/rle-benchmark.cc File be/src/benchmarks/rle-benchmark.cc: http://gerrit.cloudera.org:8080/#/c/9403/4/be/src/benchmarks/rle-benchmark.cc@39 PS4, Line 39: // for loop / run length: 10 0.4 0.4 0.4 0.487X 0.487X 0.486X : // memset / run length: 10 0.396 0.4 0.4 0.482X 0.487X 0.486X Some thoughts about this performance degradation compared to the run_length=1 case: If bit_width=1, then 8 values encoded as a repeated run can use more space than if they were encoded as a literal run. RleEncoder currently always uses repeated runs if it finds 8 repeated values - it may be good to change this to a higher number (16?) if bit_width=1. The performance seems to be similar if bit_width>1, so the space inefficiency above is probably not the real cause. I suspect that the problem with short repeated runs is that BatchedBitReader is optimized for 32*N batches, and shorter literal runs are buffered by RleBatchDecoder. It would be possible to avoid buffering in case of 8*N batches too, which could improve the performance of shorter runs. -- To view, visit http://gerrit.cloudera.org:8080/9403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Gerrit-Change-Number: 9403 Gerrit-PatchSet: 4 Gerrit-Owner: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Tue, 13 Mar 2018 15:14:53 +0000 Gerrit-HasComments: Yes
