Repository: parquet-format Updated Branches: refs/heads/master c6d306daa -> 2696f9e0a
PARQUET-1171: Clarify scope of usage for RLE, BIT_PACKED encodings See related discussions on mailing list, JIRA Author: Wes McKinney <[email protected]> Closes #79 from wesm/PARQUET-1171 and squashes the following commits: 185348e [Wes McKinney] Fix typo f29b38c [Wes McKinney] Add notes to indicate scope of usage for RLE, BIT_PACKED encodings Project: http://git-wip-us.apache.org/repos/asf/parquet-format/repo Commit: http://git-wip-us.apache.org/repos/asf/parquet-format/commit/2696f9e0 Tree: http://git-wip-us.apache.org/repos/asf/parquet-format/tree/2696f9e0 Diff: http://git-wip-us.apache.org/repos/asf/parquet-format/diff/2696f9e0 Branch: refs/heads/master Commit: 2696f9e0a966bdb98afaca69bf633750a2b02ff2 Parents: c6d306d Author: Wes McKinney <[email protected]> Authored: Tue Jan 9 22:04:57 2018 -0500 Committer: Wes McKinney <[email protected]> Committed: Tue Jan 9 22:04:57 2018 -0500 ---------------------------------------------------------------------- Encodings.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/parquet-format/blob/2696f9e0/Encodings.md ---------------------------------------------------------------------- diff --git a/Encodings.md b/Encodings.md index 0450588..28429be 100644 --- a/Encodings.md +++ b/Encodings.md @@ -59,6 +59,7 @@ Data page format: the bit width used to encode the entry ids stored as 1 byte (m followed by the values encoded using RLE/Bit packed described above (with the given bit width). ### <a name="RLE"></a>Run Length Encoding / Bit-Packing Hybrid (RLE = 3) + This encoding uses a combination of bit-packing and run length encoding to more efficiently store repeated values. The grammar for this encoding looks like this, given a fixed bit-width known in advance: @@ -103,7 +104,15 @@ repeated-value := value that is repeated, using a fixed-width of round-up-to-nex 2. varint-encode() is ULEB-128 encoding, see https://en.wikipedia.org/wiki/LEB128 +Note that the RLE encoding method is only supported for the following types of +data: + +* Repetition and definition levels +* Dictionary indices +* Boolean values in data pages, as an alternative to PLAIN encoding + ### <a name="BITPACKED"></a>Bit-packed (Deprecated) (BIT_PACKED = 4) + This is a bit-packed only encoding, which is deprecated and will be replaced by the [RLE/bit-packing](#RLE) hybrid encoding. Each value is encoded back to back using a fixed width. There is no padding between values (except for the last byte) which is padded with 0s. @@ -126,6 +135,9 @@ bit value: 00000101 00111001 01110111 bit label: ABCDEFGH IJKLMNOP QRSTUVWX ``` +Note that the BIT_PACKED encoding method is only supported for encoding +repetition and definition levels. + ### <a name="DELTAENC"></a>Delta Encoding (DELTA_BINARY_PACKED = 5) Supported Types: INT32, INT64
