[ 
https://issues.apache.org/jira/browse/PARQUET-2222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17730904#comment-17730904
 ] 

Gabor Szadovszky commented on PARQUET-2222:
-------------------------------------------

[~apitrou], [~wgtmac],

It seems my review was not deep enough. Sorry for that. So, parquet-mr does not 
use RLE encoding for boolean values in case of V1 but only bit packing: 
* 
[V1|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV1ValuesWriterFactory.java#L53]
 -> ... -> [Bit 
packing|https://github.com/apache/parquet-mr/blob/9d80330ae4948787ac0bf4e4b0d990917f106440/parquet-column/src/main/java/org/apache/parquet/column/values/bitpacking/ByteBitPackingValuesWriter.java]
 (encoding written to page header: PLAIN)
* 
[V2|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java#L57]
 -> ... -> 
[RLE|https://github.com/apache/parquet-mr/blob/9d80330ae4948787ac0bf4e4b0d990917f106440/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridValuesWriter.java]
 (encoding written to page header: RLE)

[~apitrou], could you please confirm that is the same for parquet cpp?

So the table we added in this PR about prepending the length is misleading. 
Also, the link in the PLAIN encoding for boolean is dead and misleading. It 
should point to BIT_PACKED. In the definition of BIT_PACKED it is also wrongly 
stated that it is valid only for RL/DL. I think, the deprecation is valid since 
the "BIT_PACKED" encoding should not be written to anywhere but the actual 
encoding is used under PLAIN for boolean.
Would you guys like to work on this? We probably want to add this to the 
current format release.

> [Format] RLE encoding spec incorrect for v2 data pages
> ------------------------------------------------------
>
>                 Key: PARQUET-2222
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2222
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Antoine Pitrou
>            Assignee: Gang Wu
>            Priority: Critical
>             Fix For: format-2.10.0
>
>
> The spec 
> (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
>  has this:
> {code}
> rle-bit-packed-hybrid: <length> <encoded-data>
> length := length of the <encoded-data> in bytes stored as 4 bytes little 
> endian (unsigned int32)
> {code}
> But the length is actually prepended only in v1 data pages, not in v2 data 
> pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to