Andrew Sherman has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11582 )

Change subject: IMPALA-6658: improve Parquet RLE for low bit widths
......................................................................


Patch Set 2:

(7 comments)

Thanks for the reviews

http://gerrit.cloudera.org:8080/#/c/11582/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11582/2//COMMIT_MSG@20
PS2, Line 20: By default RleEncoder will now use run length encoding for runs of
            : length 24 for single bit values, and of length 16 for 2 bit wide 
values.
            : All other bit widths will use the existing length 8 runs.
> About the solution in general:
I agree with your calculations. I think sometimes min_run_length=24 is better 
at bit width=1
I’m going to use the notation of R16 for a repeated run of length 16, and L16 
for a literal run of length 16.
So your example (bit width 1) is:
repeated lengths:               R24 R16 R24 R16 R24
min_run_length 8 (old)          2   2   2   2   2
min_run_length 24 (new)         2   3   2   3   2
min_run_length 16               2   2   2   2   2

The case where min_run_length=24 is better is where we avoid interrupting a 
literal run
                                L24 R16 L24 R16 L24
Old solution                    4   2   4   2   4
min_run_length 24               4   2   3   2   3 (one long literal run)
min_run_length 16               4   2   4   2   4

The case with R8 interrupting a literal:
                                L24 R8  L24 R8  L24
Old solution                    4   2   4   2   4
min_run_length 24               4   1   3   1   3 (one long literal run)
min_run_length 16               4   1   3   1   3 (one long literal run)

The cases for bit_width = 2 are similar.

My thinking originally was that the case of a run interrupting a literal would 
be more common than alternating runs.
But I think you are right that I should only change things where there is 
always a win.
I will use min_run_length=16 for bit width 1, and that will be the only case 
where min_run_length != 8.


http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h@203
PS2, Line 203: the previous run is flushed out
> I think that it would be helpful to update this comment with a brief discus
Done


http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h@211
PS2, Line 211: mu;tiple
> typo: multiple
Done


http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h@348
PS2, Line 348: most values
> most MAX_RUN_LENGTH_BUFFER values
Done


http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h@350
PS2, Line 350: _
> nit: period, here and elsewhere
Done


http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h@475
PS2, Line 475: // updates num_buffered_values
> This comment does not seam true here.
I think it is true, even when update_indicator_byte = false.
But the comment is bad, should be
// updates num_buffered_values_
(I am not used to that trailing underscore)


http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h@549
PS2, Line 549: we
> typo
Done



--
To view, visit http://gerrit.cloudera.org:8080/11582
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Gerrit-Change-Number: 11582
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Sherman <asher...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <asher...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <thomasmarsh...@cmu.edu>
Gerrit-Comment-Date: Mon, 22 Oct 2018 23:24:50 +0000
Gerrit-HasComments: Yes

Reply via email to