Chao Sun created PARQUET-1249:
---------------------------------
Summary: Clarify encoding schemes for boolean types
Key: PARQUET-1249
URL: https://issues.apache.org/jira/browse/PARQUET-1249
Project: Parquet
Issue Type: Improvement
Components: parquet-format
Reporter: Chao Sun
In the Parquet format specification, under [the section for Plain
encoding|https://github.com/apache/parquet-format/blob/master/Encodings.md#plain-plain--0],
boolean is encoded using the deprecated bit-packed encoding. However, [the
section for bit-packed
encoding|https://github.com/apache/parquet-format/blob/master/Encodings.md#bit-packed-deprecated-bit_packed--4]
specifies that it is only used for repetition/definition levels. This seems
contradictory.
[The section for RLE/bit-packed hybrid
encoding|https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3]
says "_Boolean values in data pages, as an alternative to PLAIN encoding_" -
perhaps we should be specific and indicate this is only used for data page V2?
Also, implementation-wise, I saw parquet-cpp still encode boolean as plain
1-bit value while parquet-mr uses bit-packed encoding as described in the
specification. Perhaps consolidation should be done for this.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)