Andrew Sherman has posted comments on this change. ( http://gerrit.cloudera.org:8080/11582 )
Change subject: IMPALA-6658: improve Parquet RLE for low bit widths ...................................................................... Patch Set 2: (7 comments) Thanks for the reviews http://gerrit.cloudera.org:8080/#/c/11582/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/11582/2//COMMIT_MSG@20 PS2, Line 20: By default RleEncoder will now use run length encoding for runs of : length 24 for single bit values, and of length 16 for 2 bit wide values. : All other bit widths will use the existing length 8 runs. > About the solution in general: I agree with your calculations. I think sometimes min_run_length=24 is better at bit width=1 I’m going to use the notation of R16 for a repeated run of length 16, and L16 for a literal run of length 16. So your example (bit width 1) is: repeated lengths: R24 R16 R24 R16 R24 min_run_length 8 (old) 2 2 2 2 2 min_run_length 24 (new) 2 3 2 3 2 min_run_length 16 2 2 2 2 2 The case where min_run_length=24 is better is where we avoid interrupting a literal run L24 R16 L24 R16 L24 Old solution 4 2 4 2 4 min_run_length 24 4 2 3 2 3 (one long literal run) min_run_length 16 4 2 4 2 4 The case with R8 interrupting a literal: L24 R8 L24 R8 L24 Old solution 4 2 4 2 4 min_run_length 24 4 1 3 1 3 (one long literal run) min_run_length 16 4 1 3 1 3 (one long literal run) The cases for bit_width = 2 are similar. My thinking originally was that the case of a run interrupting a literal would be more common than alternating runs. But I think you are right that I should only change things where there is always a win. I will use min_run_length=16 for bit width 1, and that will be the only case where min_run_length != 8. http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h File be/src/util/rle-encoding.h: http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h@203 PS2, Line 203: the previous run is flushed out > I think that it would be helpful to update this comment with a brief discus Done http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h@211 PS2, Line 211: mu;tiple > typo: multiple Done http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h@348 PS2, Line 348: most values > most MAX_RUN_LENGTH_BUFFER values Done http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h@350 PS2, Line 350: _ > nit: period, here and elsewhere Done http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h@475 PS2, Line 475: // updates num_buffered_values > This comment does not seam true here. I think it is true, even when update_indicator_byte = false. But the comment is bad, should be // updates num_buffered_values_ (I am not used to that trailing underscore) http://gerrit.cloudera.org:8080/#/c/11582/2/be/src/util/rle-encoding.h@549 PS2, Line 549: we > typo Done -- To view, visit http://gerrit.cloudera.org:8080/11582 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76 Gerrit-Change-Number: 11582 Gerrit-PatchSet: 2 Gerrit-Owner: Andrew Sherman <asher...@cloudera.com> Gerrit-Reviewer: Andrew Sherman <asher...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Thomas Marshall <thomasmarsh...@cmu.edu> Gerrit-Comment-Date: Mon, 22 Oct 2018 23:24:50 +0000 Gerrit-HasComments: Yes