[ 
https://issues.apache.org/jira/browse/PARQUET-575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15217399#comment-15217399
 ] 

Fabrizio Milo commented on PARQUET-575:
---------------------------------------

ok and that works for the level definition inside the dataPage. But then for 
the plainDictionaryDecoder ? 
https://github.com/apache/parquet-cpp/blob/master/src/parquet/encodings/dictionary-encoding.h#L61

Seems the RLE Decoder is used straight without having the initial <length> 
<encoded-data>

dumping the contents of the testdata files seems like that is the case or I am 
missing something:

dumping the dataPage for the smallint_col from the 
testdata/alltypes_plain.parquet file 

```
page size: 9  bytes
dataPage: INT32 PLAIN_DICTIONARY 8 OPTIONAL
0000: 02 00 00 00
0004: 10 01 01 03
0008: aa
```
here there is the entire page of 9 bytes. The first 4 bytes is the length, then 
two bytes for the rle run. 
At this point the documentation  
(https://github.com/Parquet/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2)
 states that:

    Data page format: the bit width used to encode the entry ids stored as 1 
byte (max bit width = 32), followed by the values    encoded using RLE/Bit 
packed described above (with the given bit width).

So I would expect another 4 bytes with the length and the following RLE runs, 
but there is not enough data for that, only for a series of RLE runs . 

Is  the documentation wrong and the length part only applies to level 
encodings? 


 

> Different RLE Encoding Specification 
> -------------------------------------
>
>                 Key: PARQUET-575
>                 URL: https://issues.apache.org/jira/browse/PARQUET-575
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Fabrizio Milo
>            Priority: Trivial
>
> In the parquet-format specification 
> https://github.com/Parquet/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3
>  is written that the RLE encoding starts with 
> ```
> rle-bit-packed-hybrid: <length> <encoded-data>
> length := length of the <encoded-data> in bytes stored as 4 bytes little 
> endian
> ```
> while in the cpp implementation there is this description 
> https://github.com/apache/parquet-cpp/blob/master/src/parquet/util/rle-encoding.h#L42
>  and the implementation seems to follow that specification
> which  does not include the initial <length> <encoded-data>
> https://github.com/apache/parquet-cpp/blob/master/src/parquet/util/rle-encoding.h#L272
> So which one is the correct? seems that the parquet-format is the wrong one.
> DataPage.definitionLevels uses RLE and none of the example format files seem 
> to have that initial <length> <encoded-data> 
> Also the use of both names `literal` and `bit-encoding` is confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to