This is an automated email from the ASF dual-hosted git repository. gangwu pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/parquet-format.git
The following commit(s) were added to refs/heads/master by this push: new 2a481fe PARQUET-2222: Fix incorrect spec for RLE encoding of data page v2 2a481fe is described below commit 2a481fe1aad64ff770e21734533bb7ef5a057dac Author: Gang Wu <ust...@gmail.com> AuthorDate: Fri Mar 24 17:52:09 2023 +0800 PARQUET-2222: Fix incorrect spec for RLE encoding of data page v2 This closes #193 --- Encodings.md | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/Encodings.md b/Encodings.md index a70ae6f..5e38d48 100644 --- a/Encodings.md +++ b/Encodings.md @@ -68,6 +68,7 @@ This encoding uses a combination of bit-packing and run length encoding to more The grammar for this encoding looks like this, given a fixed bit-width known in advance: ``` rle-bit-packed-hybrid: <length> <encoded-data> +// length is not always prepended, please check the table below for more detail length := length of the <encoded-data> in bytes stored as 4 bytes little endian (unsigned int32) encoded-data := <run>* run := <bit-packed-run> | <rle-run> @@ -123,6 +124,23 @@ data: * Dictionary indices * Boolean values in data pages, as an alternative to PLAIN encoding +Whether prepending the four-byte `length` to the `encoded-data` is summarized as the table below: +``` ++--------------+------------------------+-----------------+ +| Page kind | RLE-encoded data kind | Prepend length? | ++--------------+------------------------+-----------------+ +| Data page v1 | Definition levels | Y | +| | Repetition levels | Y | +| | Dictionary indices | N | +| | Boolean values | Y | ++--------------+------------------------+-----------------+ +| Data page v2 | Definition levels | N | +| | Repetition levels | N | +| | Dictionary indices | N | +| | Boolean values | Y | ++--------------+------------------------+-----------------+ +``` + ### <a name="BITPACKED"></a>Bit-packed (Deprecated) (BIT_PACKED = 4) This is a bit-packed only encoding, which is deprecated and will be replaced by the [RLE/bit-packing](#RLE) hybrid encoding.