[jira] [Commented] (IMPALA-6658) Parquet RLE encoding can waste space with small repeated runs

ASF subversion and git services (JIRA) Wed, 14 Nov 2018 17:02:14 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16687341#comment-16687341
 ]


ASF subversion and git services commented on IMPALA-6658:
---------------------------------------------------------

Commit d031bf82467ce5e046f464295c8ac6d6804fc196 in impala's branch 
refs/heads/master from [~asherman]
[ https://git-wip-us.apache.org/repos/asf?p=impala.git;h=d031bf8 ]

IMPALA-6658: improve Parquet RLE for low bit widths

RleEncoder buffers values in its own cache to detect run lengths that
can be efficiently encoded. When a run is detected it is written with an
indicator byte which encodes the length of the run. So an encoded
run always has an overhead of at least one byte. This means that for
single bit values, encoding 8 values as a run is inefficient.

Change RleEncoder to have the ability to use run lengths other than 8.
A new parameter to the constructor (min_run_length) allows test callers
(only) to set the minimum run length.

By default RleEncoder will now use run length encoding for runs of
length 16 for single bit values. All other bit widths will use the
existing length 8 runs.

Internally RleEncoder must buffer more values so that the longer runs
can be detected. The internal buffer “buffered_values_” is larger
and is now a circular buffer so that the first 8 bytes of the buffer can
be separately flushed to BitWriter.

Testing:

All end-to-end and unit tests pass

The unit test rle-test is enhanced to run all tests against RleEncoders
using all possible values of min_run_length. In Addition, rle-test is
refactored so that the Rle tests are in a class that inherits from
::testing::Test so that a SetUp() method can be used.
The Overflow test is enhanced to be more exhaustive (while still
completing in a second or two).

Change-Id: I191a581d3f699b6669e48ac9dc39c76ed77c4a76
Reviewed-on: http://gerrit.cloudera.org:8080/11582
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Parquet RLE encoding can waste space with small repeated runs
> -------------------------------------------------------------
>
>                 Key: IMPALA-6658
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6658
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Csaba Ringhofer
>            Assignee: Andrew Sherman
>            Priority: Minor
>              Labels: parquet, ramp-up
>
> Currently RleEncoder creates repeated runs from 8 repeated values, which can 
> be less space efficient than bit-packed if bit width is 1 or 2. In the worst 
> case, the whole data page can be ~2X larger if bit width is 1, and ~1.25X 
> larger if bit is 2 compared to bit-packing.
> A comment in rle_encoding.h writes different numbers, but it probably does 
> not calculate with the overhead of splitting long runs into smaller ones 
> (every run adds +1 byte for its length): 
> [https://github.com/apache/impala/blob/8079cd9d2a87051f81a41910b74fab15e35f36ea/be/src/util/rle-encoding.h#L62]
> Note that if the data page is compressed, this size difference probably 
> disappears, but the larger uncompressed buffer size can still affect 
> performance.
> Parquet RLE encoding is described here: 
> [https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding-bit-packing-hybrid-rle-3]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-6658) Parquet RLE encoding can waste space with small repeated runs

Reply via email to