[
https://issues.apache.org/jira/browse/PARQUET-575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15218046#comment-15218046
]
Deepak Majeti commented on PARQUET-575:
---------------------------------------
RLE/Bit packed hybrid requires 3 values (rle_buffer, rle_length, bit_width).
Some of these values are optimized for the levels and dictionary data.
In the case of levels, the bit_width is obtained from the maximum
definition/repetition levels and only the rle_length followed by rle_buffer is
stored explicitly.
https://github.com/apache/parquet-cpp/blob/master/src/parquet/column/levels.h#L109
In the case of dictionary pages, the rle_length is the size of the dictionary
page itself and only the bit_width followed by the rle_buffer is stored
explicitly.
https://github.com/apache/parquet-cpp/blob/master/src/parquet/column/reader.cc#L55
https://github.com/apache/parquet-cpp/blob/master/src/parquet/encodings/dictionary-encoding.h#L55
> Different RLE Encoding Specification
> -------------------------------------
>
> Key: PARQUET-575
> URL: https://issues.apache.org/jira/browse/PARQUET-575
> Project: Parquet
> Issue Type: Improvement
> Reporter: Fabrizio Milo
> Priority: Trivial
>
> In the parquet-format specification
> https://github.com/Parquet/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3
> is written that the RLE encoding starts with
> ```
> rle-bit-packed-hybrid: <length> <encoded-data>
> length := length of the <encoded-data> in bytes stored as 4 bytes little
> endian
> ```
> while in the cpp implementation there is this description
> https://github.com/apache/parquet-cpp/blob/master/src/parquet/util/rle-encoding.h#L42
> and the implementation seems to follow that specification
> which does not include the initial <length> <encoded-data>
> https://github.com/apache/parquet-cpp/blob/master/src/parquet/util/rle-encoding.h#L272
> So which one is the correct? seems that the parquet-format is the wrong one.
> DataPage.definitionLevels uses RLE and none of the example format files seem
> to have that initial <length> <encoded-data>
> Also the use of both names `literal` and `bit-encoding` is confusing.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)