This is an automated email from the ASF dual-hosted git repository.
github-bot pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/parquet-site.git
The following commit(s) were added to refs/heads/asf-site by this push:
new b33a583 deploy: fce22def65a1a92e66634b118ad4b6cc3f707ec5
b33a583 is described below
commit b33a5839855ec40d63dc5f4ab8807404aa975b3e
Author: alamb <[email protected]>
AuthorDate: Thu Feb 12 12:38:34 2026 +0000
deploy: fce22def65a1a92e66634b118ad4b6cc3f707ec5
---
.../_print/docs/file-format/data-pages/index.html | 15 ++--
output/_print/docs/file-format/index.html | 15 ++--
output/_print/docs/index.html | 15 ++--
.../file-format/data-pages/encodings/index.html | 43 +++++------
output/docs/file-format/data-pages/index.xml | 85 +++++++++++++++++-----
output/index.xml | 85 +++++++++++++++++-----
6 files changed, 176 insertions(+), 82 deletions(-)
diff --git a/output/_print/docs/file-format/data-pages/index.html
b/output/_print/docs/file-format/data-pages/index.html
index 27a0b78..91b8b7f 100644
--- a/output/_print/docs/file-format/data-pages/index.html
+++ b/output/_print/docs/file-format/data-pages/index.html
@@ -47,17 +47,18 @@ when implementing this format, the implementation provided
by the
<a href=https://facebook.github.io/zstd/>Zstandard compression library</a>
is authoritative.</p><h3 id=lz4_raw>LZ4_RAW</h3><p>A codec based on the <a
href=https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md>LZ4 block
format</a>.
If any ambiguity arises when implementing this format, the implementation
-provided by the <a href=https://www.lz4.org/>LZ4 compression library</a> is
authoritative.</p></div><div class=td-content
style=page-break-before:always><h1 id=pg-9aa971e751fdd370158d525ad337ef7a>2
-</h1><h1 id=parquet-encoding-definitions>Parquet encoding
definitions</h1><p>This file contains the specification of all supported
encodings.</p><p><a name=PLAIN></a></p><h3 id=plain-plain--0>Plain: (PLAIN =
0)</h3><p>Supported Types: all</p><p>This is the plain encoding that must be
supporte [...]
+provided by the <a href=https://www.lz4.org/>LZ4 compression library</a> is
authoritative.</p></div><div class=td-content
style=page-break-before:always><h1 id=pg-9aa971e751fdd370158d525ad337ef7a>2
-</h1><h1 id=parquet-encoding-definitions>Parquet encoding
definitions</h1><p>This file contains the specification of all supported
encodings.</p><p>Unless otherwise stated in page or encoding documentation, any
encoding can be
+used with any page type.</p><h3 id=supported-encodings>Supported
Encodings</h3><p>For details on current implementation status, see the <a
href=https://parquet.apache.org/docs/file-format/implementationstatus/#encodings>Implementation
Status</a> page.</p><table><thead><tr><th>Encoding type</th><th>Encoding
enum</th><th>Supported Types</th></tr></thead><tbody><tr><td><a
href=/docs/file-format/data-pages/encodings/#PLAIN>Plain</a></td><td>PLAIN =
0</td><td>All Physical Types</td></tr><tr>< [...]
intended to be the simplest encoding. Values are encoded back to
back.</p><p>The plain encoding is used whenever a more efficient encoding can
not be used. It
stores the data in the following format:</p><ul><li>BOOLEAN: <a
href=/docs/file-format/data-pages/encodings/#BITPACKED>Bit Packed</a>, LSB
first</li><li>INT32: 4 bytes little endian</li><li>INT64: 8 bytes little
endian</li><li>INT96: 12 bytes little endian (deprecated)</li><li>FLOAT: 4
bytes IEEE little endian</li><li>DOUBLE: 8 bytes IEEE little
endian</li><li>BYTE_ARRAY: length in 4 bytes little endian followed by the
bytes contained in the array</li><li>FIXED_LEN_BYTE_ARRAY: the bytes [...]
point types are encoded in IEEE.</p><p>For the byte array type, it encodes the
length as a 4 byte little
-endian, followed by the bytes.</p><h3
id=dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8>Dictionary
Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8)</h3><p>The dictionary
encoding builds a dictionary of values encountered in a given column. The
+endian, followed by the bytes.</p><p><a name=DICTIONARY></a></p><h3
id=dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8>Dictionary
Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8)</h3><p>The dictionary
encoding builds a dictionary of values encountered in a given column. The
dictionary will be stored in a dictionary page per column chunk. The values
are stored as integers
using the <a href=/docs/file-format/data-pages/encodings/#RLE>RLE/Bit-Packing
Hybrid</a> encoding. If the dictionary grows too big, whether in size
or number of distinct values, the encoding will fall back to the plain
encoding. The dictionary page is
written first, before the data pages of the column chunk.</p><p>Dictionary
page format: the entries in the dictionary using the <a
href=/docs/file-format/data-pages/encodings/#PLAIN>plain</a>
encoding.</p><p>Data page format: the bit width used to encode the entry ids
stored as 1 byte (max bit width = 32),
-followed by the values encoded using RLE/Bit packed described above (with the
given bit width).</p><p>Using the PLAIN_DICTIONARY enum value is deprecated in
the Parquet 2.0 specification. Prefer using RLE_DICTIONARY
-in a data page and PLAIN in a dictionary page for Parquet 2.0+ files.</p><p><a
name=RLE></a></p><h3 id=run-length-encoding--bit-packing-hybrid-rle--3>Run
Length Encoding / Bit-Packing Hybrid (RLE = 3)</h3><p>This encoding uses a
combination of bit-packing and run length encoding to more efficiently store
repeated values.</p><p>The grammar for this encoding looks like this, given a
fixed bit-width known in advance:</p><pre
tabindex=0><code>rle-bit-packed-hybrid: <length> <encoded [...]
+followed by the values encoded using RLE/Bit packed described above (with the
given bit width).</p><p>Using the <code>PLAIN_DICTIONARY</code> enum value is
deprecated, use <code>RLE_DICTIONARY</code>
+in a data page and <code>PLAIN</code> in a dictionary page for new Parquet
files.</p><p><a name=RLE></a></p><h3
id=run-length-encoding--bit-packing-hybrid-rle--3>Run Length Encoding /
Bit-Packing Hybrid (RLE = 3)</h3><p>This encoding uses a combination of
bit-packing and run length encoding to more efficiently store repeated
values.</p><p>The grammar for this encoding looks like this, given a fixed
bit-width known in advance:</p><pre tabindex=0><code>rle-bit-packed-hybrid:
<length> [...]
// length is not always prepended, please check the table below for more detail
length := length of the <encoded-data> in bytes stored as 4 bytes little
endian (unsigned int32)
encoded-data := <run>*
@@ -157,13 +158,13 @@ but in real cases it would be invalid.</p><h4
id=example-1>Example 1</h4><p>1, 2
1 (minimum delta), 0 (bitwidth), (no data needed for bitwidth 0)</p><h4
id=example-2>Example 2</h4><p>7, 5, 3, 1, 2, 3, 4, 5, the deltas would
be</p><p>-2, -2, -2, 1, 1, 1, 1</p><p>The minimum is -2, so the relative deltas
are:</p><p>0, 0, 0, 3, 3, 3, 3</p><p>The encoded data is</p><p>header:
8 (block size), 1 (miniblock count), 8 (value count), 7 (first
value)</p><p>block:
-2 (minimum delta), 2 (bitwidth), 00000011111111b (0,0,0,3,3,3,3 packed on 2
bits)</p><h4 id=characteristics>Characteristics</h4><p>This encoding is similar
to the <a href=/docs/file-format/data-pages/encodings/#RLE>RLE/bit-packing</a>
encoding. However the <a
href=/docs/file-format/data-pages/encodings/#RLE>RLE/bit-packing</a> encoding
is specifically used when the range of ints is small over the entire page, as
is true of repetition and definition levels. It uses a single bit width for
[...]
-The delta encoding algorithm described above stores a bit width per miniblock
and is less sensitive to variations in the size of encoded integers. It is also
somewhat doing RLE encoding as a block containing all the same values will be
bit packed to a zero bit width thus being only a header.</p><h3
id=delta-length-byte-array-delta_length_byte_array--6>Delta-length byte array:
(DELTA_LENGTH_BYTE_ARRAY = 6)</h3><p>Supported Types: BYTE_ARRAY</p><p>This
encoding is always preferred over PLA [...]
+The delta encoding algorithm described above stores a bit width per miniblock
and is less sensitive to variations in the size of encoded integers. It is also
somewhat doing RLE encoding as a block containing all the same values will be
bit packed to a zero bit width thus being only a header.</p><p><a
name=DELTALENGTH></a></p><h3
id=delta-length-byte-array-delta_length_byte_array--6>Delta-length byte array:
(DELTA_LENGTH_BYTE_ARRAY = 6)</h3><p>Supported Types: BYTE_ARRAY</p><p>This
encodi [...]
encoding (DELTA_BINARY_PACKED). The byte array data follows all of the length
data just
concatenated back to back. The expected savings is from the cost of encoding
the lengths
and possibly better compression in the data (it is no longer interleaved with
the lengths).</p><p>The data stream looks like:</p><pre
tabindex=0><code><Delta Encoded Lengths> <Byte Array Data>
-</code></pre><p>For example, if the data was “Hello”,
“World”, “Foobar”, “ABCDEF”</p><p>then the
encoded data would be comprised of the following
segments:</p><ul><li>DeltaEncoding(5, 5, 6, 6) (the string
lengths)</li><li>“HelloWorldFoobarABCDEF”</li></ul><h3
id=delta-strings-delta_byte_array--7>Delta Strings: (DELTA_BYTE_ARRAY =
7)</h3><p>Supported Types: BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY</p><p>This is also
known as incremental [...]
+</code></pre><p>For example, if the data was “Hello”,
“World”, “Foobar”, “ABCDEF”</p><p>then the
encoded data would be comprised of the following
segments:</p><ul><li>DeltaEncoding(5, 5, 6, 6) (the string
lengths)</li><li>“HelloWorldFoobarABCDEF”</li></ul><p><a
name=DELTASTRING></a></p><h3 id=delta-strings-delta_byte_array--7>Delta
Strings: (DELTA_BYTE_ARRAY = 7)</h3><p>Supported Types: BYTE_ARRAY,
FIXED_LEN_BYTE_ARRAY</p><p>Thi [...]
sequence of strings, store the prefix length of the previous entry plus the
suffix.</p><p>For a longer description, see <a
href=https://en.wikipedia.org/wiki/Incremental_encoding>https://en.wikipedia.org/wiki/Incremental_encoding</a>.</p><p>This
is stored as a sequence of delta-encoded prefix lengths (DELTA_BINARY_PACKED),
followed by
-the suffixes encoded as delta length byte arrays
(DELTA_LENGTH_BYTE_ARRAY).</p><p>For example, if the data was
“axis”, “axle”, “babble”,
“babyhood”</p><p>then the encoded data would be comprised of the
following segments:</p><ul><li>DeltaEncoding(0, 2, 0, 3) (the prefix
lengths)</li><li>DeltaEncoding(4, 2, 6, 5) (the suffix
lengths)</li><li>“axislebabbleyhood”</li></ul><p>Note that, even
for FIXED_LEN_BYTE_ARRAY, all lengths are [...]
+the suffixes encoded as delta length byte arrays
(DELTA_LENGTH_BYTE_ARRAY).</p><p>For example, if the data was
“axis”, “axle”, “babble”,
“babyhood”</p><p>then the encoded data would be comprised of the
following segments:</p><ul><li>DeltaEncoding(0, 2, 0, 3) (the prefix
lengths)</li><li>DeltaEncoding(4, 2, 6, 5) (the suffix
lengths)</li><li>“axislebabbleyhood”</li></ul><p>Note that, even
for FIXED_LEN_BYTE_ARRAY, all lengths are [...]
compression ratio and speed when a compression algorithm is used
afterwards.</p><p>This encoding creates K byte-streams of length N where K is
the size in bytes of the data
type and N is the number of elements in the data sequence. For example, K is 4
for FLOAT
type and 8 for DOUBLE type.</p><p>The bytes of each value are scattered to the
corresponding streams. The 0-th byte goes to the
diff --git a/output/_print/docs/file-format/index.html
b/output/_print/docs/file-format/index.html
index 531060f..0cd1ae0 100644
--- a/output/_print/docs/file-format/index.html
+++ b/output/_print/docs/file-format/index.html
@@ -1194,17 +1194,18 @@ when implementing this format, the implementation
provided by the
<a href=https://facebook.github.io/zstd/>Zstandard compression library</a>
is authoritative.</p><h3 id=lz4_raw>LZ4_RAW</h3><p>A codec based on the <a
href=https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md>LZ4 block
format</a>.
If any ambiguity arises when implementing this format, the implementation
-provided by the <a href=https://www.lz4.org/>LZ4 compression library</a> is
authoritative.</p></div><div class=td-content
style=page-break-before:always><h1 id=pg-9aa971e751fdd370158d525ad337ef7a>8.2
-</h1><h1 id=parquet-encoding-definitions>Parquet encoding
definitions</h1><p>This file contains the specification of all supported
encodings.</p><p><a name=PLAIN></a></p><h3 id=plain-plain--0>Plain: (PLAIN =
0)</h3><p>Supported Types: all</p><p>This is the plain encoding that must be
suppor [...]
+provided by the <a href=https://www.lz4.org/>LZ4 compression library</a> is
authoritative.</p></div><div class=td-content
style=page-break-before:always><h1 id=pg-9aa971e751fdd370158d525ad337ef7a>8.2
-</h1><h1 id=parquet-encoding-definitions>Parquet encoding
definitions</h1><p>This file contains the specification of all supported
encodings.</p><p>Unless otherwise stated in page or encoding documentation, any
encoding can be
+used with any page type.</p><h3 id=supported-encodings>Supported
Encodings</h3><p>For details on current implementation status, see the <a
href=https://parquet.apache.org/docs/file-format/implementationstatus/#encodings>Implementation
Status</a> page.</p><table><thead><tr><th>Encoding type</th><th>Encoding
enum</th><th>Supported Types</th></tr></thead><tbody><tr><td><a
href=/docs/file-format/data-pages/encodings/#PLAIN>Plain</a></td><td>PLAIN =
0</td><td>All Physical Types</td></tr><tr>< [...]
intended to be the simplest encoding. Values are encoded back to
back.</p><p>The plain encoding is used whenever a more efficient encoding can
not be used. It
stores the data in the following format:</p><ul><li>BOOLEAN: <a
href=/docs/file-format/data-pages/encodings/#BITPACKED>Bit Packed</a>, LSB
first</li><li>INT32: 4 bytes little endian</li><li>INT64: 8 bytes little
endian</li><li>INT96: 12 bytes little endian (deprecated)</li><li>FLOAT: 4
bytes IEEE little endian</li><li>DOUBLE: 8 bytes IEEE little
endian</li><li>BYTE_ARRAY: length in 4 bytes little endian followed by the
bytes contained in the array</li><li>FIXED_LEN_BYTE_ARRAY: the bytes [...]
point types are encoded in IEEE.</p><p>For the byte array type, it encodes the
length as a 4 byte little
-endian, followed by the bytes.</p><h3
id=dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8>Dictionary
Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8)</h3><p>The dictionary
encoding builds a dictionary of values encountered in a given column. The
+endian, followed by the bytes.</p><p><a name=DICTIONARY></a></p><h3
id=dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8>Dictionary
Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8)</h3><p>The dictionary
encoding builds a dictionary of values encountered in a given column. The
dictionary will be stored in a dictionary page per column chunk. The values
are stored as integers
using the <a href=/docs/file-format/data-pages/encodings/#RLE>RLE/Bit-Packing
Hybrid</a> encoding. If the dictionary grows too big, whether in size
or number of distinct values, the encoding will fall back to the plain
encoding. The dictionary page is
written first, before the data pages of the column chunk.</p><p>Dictionary
page format: the entries in the dictionary using the <a
href=/docs/file-format/data-pages/encodings/#PLAIN>plain</a>
encoding.</p><p>Data page format: the bit width used to encode the entry ids
stored as 1 byte (max bit width = 32),
-followed by the values encoded using RLE/Bit packed described above (with the
given bit width).</p><p>Using the PLAIN_DICTIONARY enum value is deprecated in
the Parquet 2.0 specification. Prefer using RLE_DICTIONARY
-in a data page and PLAIN in a dictionary page for Parquet 2.0+ files.</p><p><a
name=RLE></a></p><h3 id=run-length-encoding--bit-packing-hybrid-rle--3>Run
Length Encoding / Bit-Packing Hybrid (RLE = 3)</h3><p>This encoding uses a
combination of bit-packing and run length encoding to more efficiently store
repeated values.</p><p>The grammar for this encoding looks like this, given a
fixed bit-width known in advance:</p><pre
tabindex=0><code>rle-bit-packed-hybrid: <length> <encoded [...]
+followed by the values encoded using RLE/Bit packed described above (with the
given bit width).</p><p>Using the <code>PLAIN_DICTIONARY</code> enum value is
deprecated, use <code>RLE_DICTIONARY</code>
+in a data page and <code>PLAIN</code> in a dictionary page for new Parquet
files.</p><p><a name=RLE></a></p><h3
id=run-length-encoding--bit-packing-hybrid-rle--3>Run Length Encoding /
Bit-Packing Hybrid (RLE = 3)</h3><p>This encoding uses a combination of
bit-packing and run length encoding to more efficiently store repeated
values.</p><p>The grammar for this encoding looks like this, given a fixed
bit-width known in advance:</p><pre tabindex=0><code>rle-bit-packed-hybrid:
<length> [...]
// length is not always prepended, please check the table below for more detail
length := length of the <encoded-data> in bytes stored as 4 bytes little
endian (unsigned int32)
encoded-data := <run>*
@@ -1304,13 +1305,13 @@ but in real cases it would be invalid.</p><h4
id=example-1>Example 1</h4><p>1, 2
1 (minimum delta), 0 (bitwidth), (no data needed for bitwidth 0)</p><h4
id=example-2>Example 2</h4><p>7, 5, 3, 1, 2, 3, 4, 5, the deltas would
be</p><p>-2, -2, -2, 1, 1, 1, 1</p><p>The minimum is -2, so the relative deltas
are:</p><p>0, 0, 0, 3, 3, 3, 3</p><p>The encoded data is</p><p>header:
8 (block size), 1 (miniblock count), 8 (value count), 7 (first
value)</p><p>block:
-2 (minimum delta), 2 (bitwidth), 00000011111111b (0,0,0,3,3,3,3 packed on 2
bits)</p><h4 id=characteristics>Characteristics</h4><p>This encoding is similar
to the <a href=/docs/file-format/data-pages/encodings/#RLE>RLE/bit-packing</a>
encoding. However the <a
href=/docs/file-format/data-pages/encodings/#RLE>RLE/bit-packing</a> encoding
is specifically used when the range of ints is small over the entire page, as
is true of repetition and definition levels. It uses a single bit width for
[...]
-The delta encoding algorithm described above stores a bit width per miniblock
and is less sensitive to variations in the size of encoded integers. It is also
somewhat doing RLE encoding as a block containing all the same values will be
bit packed to a zero bit width thus being only a header.</p><h3
id=delta-length-byte-array-delta_length_byte_array--6>Delta-length byte array:
(DELTA_LENGTH_BYTE_ARRAY = 6)</h3><p>Supported Types: BYTE_ARRAY</p><p>This
encoding is always preferred over PLA [...]
+The delta encoding algorithm described above stores a bit width per miniblock
and is less sensitive to variations in the size of encoded integers. It is also
somewhat doing RLE encoding as a block containing all the same values will be
bit packed to a zero bit width thus being only a header.</p><p><a
name=DELTALENGTH></a></p><h3
id=delta-length-byte-array-delta_length_byte_array--6>Delta-length byte array:
(DELTA_LENGTH_BYTE_ARRAY = 6)</h3><p>Supported Types: BYTE_ARRAY</p><p>This
encodi [...]
encoding (DELTA_BINARY_PACKED). The byte array data follows all of the length
data just
concatenated back to back. The expected savings is from the cost of encoding
the lengths
and possibly better compression in the data (it is no longer interleaved with
the lengths).</p><p>The data stream looks like:</p><pre
tabindex=0><code><Delta Encoded Lengths> <Byte Array Data>
-</code></pre><p>For example, if the data was “Hello”,
“World”, “Foobar”, “ABCDEF”</p><p>then the
encoded data would be comprised of the following
segments:</p><ul><li>DeltaEncoding(5, 5, 6, 6) (the string
lengths)</li><li>“HelloWorldFoobarABCDEF”</li></ul><h3
id=delta-strings-delta_byte_array--7>Delta Strings: (DELTA_BYTE_ARRAY =
7)</h3><p>Supported Types: BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY</p><p>This is also
known as incremental [...]
+</code></pre><p>For example, if the data was “Hello”,
“World”, “Foobar”, “ABCDEF”</p><p>then the
encoded data would be comprised of the following
segments:</p><ul><li>DeltaEncoding(5, 5, 6, 6) (the string
lengths)</li><li>“HelloWorldFoobarABCDEF”</li></ul><p><a
name=DELTASTRING></a></p><h3 id=delta-strings-delta_byte_array--7>Delta
Strings: (DELTA_BYTE_ARRAY = 7)</h3><p>Supported Types: BYTE_ARRAY,
FIXED_LEN_BYTE_ARRAY</p><p>Thi [...]
sequence of strings, store the prefix length of the previous entry plus the
suffix.</p><p>For a longer description, see <a
href=https://en.wikipedia.org/wiki/Incremental_encoding>https://en.wikipedia.org/wiki/Incremental_encoding</a>.</p><p>This
is stored as a sequence of delta-encoded prefix lengths (DELTA_BINARY_PACKED),
followed by
-the suffixes encoded as delta length byte arrays
(DELTA_LENGTH_BYTE_ARRAY).</p><p>For example, if the data was
“axis”, “axle”, “babble”,
“babyhood”</p><p>then the encoded data would be comprised of the
following segments:</p><ul><li>DeltaEncoding(0, 2, 0, 3) (the prefix
lengths)</li><li>DeltaEncoding(4, 2, 6, 5) (the suffix
lengths)</li><li>“axislebabbleyhood”</li></ul><p>Note that, even
for FIXED_LEN_BYTE_ARRAY, all lengths are [...]
+the suffixes encoded as delta length byte arrays
(DELTA_LENGTH_BYTE_ARRAY).</p><p>For example, if the data was
“axis”, “axle”, “babble”,
“babyhood”</p><p>then the encoded data would be comprised of the
following segments:</p><ul><li>DeltaEncoding(0, 2, 0, 3) (the prefix
lengths)</li><li>DeltaEncoding(4, 2, 6, 5) (the suffix
lengths)</li><li>“axislebabbleyhood”</li></ul><p>Note that, even
for FIXED_LEN_BYTE_ARRAY, all lengths are [...]
compression ratio and speed when a compression algorithm is used
afterwards.</p><p>This encoding creates K byte-streams of length N where K is
the size in bytes of the data
type and N is the number of elements in the data sequence. For example, K is 4
for FLOAT
type and 8 for DOUBLE type.</p><p>The bytes of each value are scattered to the
corresponding streams. The 0-th byte goes to the
diff --git a/output/_print/docs/index.html b/output/_print/docs/index.html
index 376a875..2d0d190 100644
--- a/output/_print/docs/index.html
+++ b/output/_print/docs/index.html
@@ -1210,17 +1210,18 @@ when implementing this format, the implementation
provided by the
<a href=https://facebook.github.io/zstd/>Zstandard compression library</a>
is authoritative.</p><h3 id=lz4_raw>LZ4_RAW</h3><p>A codec based on the <a
href=https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md>LZ4 block
format</a>.
If any ambiguity arises when implementing this format, the implementation
-provided by the <a href=https://www.lz4.org/>LZ4 compression library</a> is
authoritative.</p></div><div class=td-content
style=page-break-before:always><h1 id=pg-9aa971e751fdd370158d525ad337ef7a>3.8.2
-</h1><h1 id=parquet-encoding-definitions>Parquet encoding
definitions</h1><p>This file contains the specification of all supported
encodings.</p><p><a name=PLAIN></a></p><h3 id=plain-plain--0>Plain: (PLAIN =
0)</h3><p>Supported Types: all</p><p>This is the plain encoding that must be
supp [...]
+provided by the <a href=https://www.lz4.org/>LZ4 compression library</a> is
authoritative.</p></div><div class=td-content
style=page-break-before:always><h1 id=pg-9aa971e751fdd370158d525ad337ef7a>3.8.2
-</h1><h1 id=parquet-encoding-definitions>Parquet encoding
definitions</h1><p>This file contains the specification of all supported
encodings.</p><p>Unless otherwise stated in page or encoding documentation, any
encoding can be
+used with any page type.</p><h3 id=supported-encodings>Supported
Encodings</h3><p>For details on current implementation status, see the <a
href=https://parquet.apache.org/docs/file-format/implementationstatus/#encodings>Implementation
Status</a> page.</p><table><thead><tr><th>Encoding type</th><th>Encoding
enum</th><th>Supported Types</th></tr></thead><tbody><tr><td><a
href=/docs/file-format/data-pages/encodings/#PLAIN>Plain</a></td><td>PLAIN =
0</td><td>All Physical Types</td></tr><tr>< [...]
intended to be the simplest encoding. Values are encoded back to
back.</p><p>The plain encoding is used whenever a more efficient encoding can
not be used. It
stores the data in the following format:</p><ul><li>BOOLEAN: <a
href=/docs/file-format/data-pages/encodings/#BITPACKED>Bit Packed</a>, LSB
first</li><li>INT32: 4 bytes little endian</li><li>INT64: 8 bytes little
endian</li><li>INT96: 12 bytes little endian (deprecated)</li><li>FLOAT: 4
bytes IEEE little endian</li><li>DOUBLE: 8 bytes IEEE little
endian</li><li>BYTE_ARRAY: length in 4 bytes little endian followed by the
bytes contained in the array</li><li>FIXED_LEN_BYTE_ARRAY: the bytes [...]
point types are encoded in IEEE.</p><p>For the byte array type, it encodes the
length as a 4 byte little
-endian, followed by the bytes.</p><h3
id=dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8>Dictionary
Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8)</h3><p>The dictionary
encoding builds a dictionary of values encountered in a given column. The
+endian, followed by the bytes.</p><p><a name=DICTIONARY></a></p><h3
id=dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8>Dictionary
Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8)</h3><p>The dictionary
encoding builds a dictionary of values encountered in a given column. The
dictionary will be stored in a dictionary page per column chunk. The values
are stored as integers
using the <a href=/docs/file-format/data-pages/encodings/#RLE>RLE/Bit-Packing
Hybrid</a> encoding. If the dictionary grows too big, whether in size
or number of distinct values, the encoding will fall back to the plain
encoding. The dictionary page is
written first, before the data pages of the column chunk.</p><p>Dictionary
page format: the entries in the dictionary using the <a
href=/docs/file-format/data-pages/encodings/#PLAIN>plain</a>
encoding.</p><p>Data page format: the bit width used to encode the entry ids
stored as 1 byte (max bit width = 32),
-followed by the values encoded using RLE/Bit packed described above (with the
given bit width).</p><p>Using the PLAIN_DICTIONARY enum value is deprecated in
the Parquet 2.0 specification. Prefer using RLE_DICTIONARY
-in a data page and PLAIN in a dictionary page for Parquet 2.0+ files.</p><p><a
name=RLE></a></p><h3 id=run-length-encoding--bit-packing-hybrid-rle--3>Run
Length Encoding / Bit-Packing Hybrid (RLE = 3)</h3><p>This encoding uses a
combination of bit-packing and run length encoding to more efficiently store
repeated values.</p><p>The grammar for this encoding looks like this, given a
fixed bit-width known in advance:</p><pre
tabindex=0><code>rle-bit-packed-hybrid: <length> <encoded [...]
+followed by the values encoded using RLE/Bit packed described above (with the
given bit width).</p><p>Using the <code>PLAIN_DICTIONARY</code> enum value is
deprecated, use <code>RLE_DICTIONARY</code>
+in a data page and <code>PLAIN</code> in a dictionary page for new Parquet
files.</p><p><a name=RLE></a></p><h3
id=run-length-encoding--bit-packing-hybrid-rle--3>Run Length Encoding /
Bit-Packing Hybrid (RLE = 3)</h3><p>This encoding uses a combination of
bit-packing and run length encoding to more efficiently store repeated
values.</p><p>The grammar for this encoding looks like this, given a fixed
bit-width known in advance:</p><pre tabindex=0><code>rle-bit-packed-hybrid:
<length> [...]
// length is not always prepended, please check the table below for more detail
length := length of the <encoded-data> in bytes stored as 4 bytes little
endian (unsigned int32)
encoded-data := <run>*
@@ -1320,13 +1321,13 @@ but in real cases it would be invalid.</p><h4
id=example-1>Example 1</h4><p>1, 2
1 (minimum delta), 0 (bitwidth), (no data needed for bitwidth 0)</p><h4
id=example-2>Example 2</h4><p>7, 5, 3, 1, 2, 3, 4, 5, the deltas would
be</p><p>-2, -2, -2, 1, 1, 1, 1</p><p>The minimum is -2, so the relative deltas
are:</p><p>0, 0, 0, 3, 3, 3, 3</p><p>The encoded data is</p><p>header:
8 (block size), 1 (miniblock count), 8 (value count), 7 (first
value)</p><p>block:
-2 (minimum delta), 2 (bitwidth), 00000011111111b (0,0,0,3,3,3,3 packed on 2
bits)</p><h4 id=characteristics>Characteristics</h4><p>This encoding is similar
to the <a href=/docs/file-format/data-pages/encodings/#RLE>RLE/bit-packing</a>
encoding. However the <a
href=/docs/file-format/data-pages/encodings/#RLE>RLE/bit-packing</a> encoding
is specifically used when the range of ints is small over the entire page, as
is true of repetition and definition levels. It uses a single bit width for
[...]
-The delta encoding algorithm described above stores a bit width per miniblock
and is less sensitive to variations in the size of encoded integers. It is also
somewhat doing RLE encoding as a block containing all the same values will be
bit packed to a zero bit width thus being only a header.</p><h3
id=delta-length-byte-array-delta_length_byte_array--6>Delta-length byte array:
(DELTA_LENGTH_BYTE_ARRAY = 6)</h3><p>Supported Types: BYTE_ARRAY</p><p>This
encoding is always preferred over PLA [...]
+The delta encoding algorithm described above stores a bit width per miniblock
and is less sensitive to variations in the size of encoded integers. It is also
somewhat doing RLE encoding as a block containing all the same values will be
bit packed to a zero bit width thus being only a header.</p><p><a
name=DELTALENGTH></a></p><h3
id=delta-length-byte-array-delta_length_byte_array--6>Delta-length byte array:
(DELTA_LENGTH_BYTE_ARRAY = 6)</h3><p>Supported Types: BYTE_ARRAY</p><p>This
encodi [...]
encoding (DELTA_BINARY_PACKED). The byte array data follows all of the length
data just
concatenated back to back. The expected savings is from the cost of encoding
the lengths
and possibly better compression in the data (it is no longer interleaved with
the lengths).</p><p>The data stream looks like:</p><pre
tabindex=0><code><Delta Encoded Lengths> <Byte Array Data>
-</code></pre><p>For example, if the data was “Hello”,
“World”, “Foobar”, “ABCDEF”</p><p>then the
encoded data would be comprised of the following
segments:</p><ul><li>DeltaEncoding(5, 5, 6, 6) (the string
lengths)</li><li>“HelloWorldFoobarABCDEF”</li></ul><h3
id=delta-strings-delta_byte_array--7>Delta Strings: (DELTA_BYTE_ARRAY =
7)</h3><p>Supported Types: BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY</p><p>This is also
known as incremental [...]
+</code></pre><p>For example, if the data was “Hello”,
“World”, “Foobar”, “ABCDEF”</p><p>then the
encoded data would be comprised of the following
segments:</p><ul><li>DeltaEncoding(5, 5, 6, 6) (the string
lengths)</li><li>“HelloWorldFoobarABCDEF”</li></ul><p><a
name=DELTASTRING></a></p><h3 id=delta-strings-delta_byte_array--7>Delta
Strings: (DELTA_BYTE_ARRAY = 7)</h3><p>Supported Types: BYTE_ARRAY,
FIXED_LEN_BYTE_ARRAY</p><p>Thi [...]
sequence of strings, store the prefix length of the previous entry plus the
suffix.</p><p>For a longer description, see <a
href=https://en.wikipedia.org/wiki/Incremental_encoding>https://en.wikipedia.org/wiki/Incremental_encoding</a>.</p><p>This
is stored as a sequence of delta-encoded prefix lengths (DELTA_BINARY_PACKED),
followed by
-the suffixes encoded as delta length byte arrays
(DELTA_LENGTH_BYTE_ARRAY).</p><p>For example, if the data was
“axis”, “axle”, “babble”,
“babyhood”</p><p>then the encoded data would be comprised of the
following segments:</p><ul><li>DeltaEncoding(0, 2, 0, 3) (the prefix
lengths)</li><li>DeltaEncoding(4, 2, 6, 5) (the suffix
lengths)</li><li>“axislebabbleyhood”</li></ul><p>Note that, even
for FIXED_LEN_BYTE_ARRAY, all lengths are [...]
+the suffixes encoded as delta length byte arrays
(DELTA_LENGTH_BYTE_ARRAY).</p><p>For example, if the data was
“axis”, “axle”, “babble”,
“babyhood”</p><p>then the encoded data would be comprised of the
following segments:</p><ul><li>DeltaEncoding(0, 2, 0, 3) (the prefix
lengths)</li><li>DeltaEncoding(4, 2, 6, 5) (the suffix
lengths)</li><li>“axislebabbleyhood”</li></ul><p>Note that, even
for FIXED_LEN_BYTE_ARRAY, all lengths are [...]
compression ratio and speed when a compression algorithm is used
afterwards.</p><p>This encoding creates K byte-streams of length N where K is
the size in bytes of the data
type and N is the number of elements in the data sequence. For example, K is 4
for FLOAT
type and 8 for DOUBLE type.</p><p>The bytes of each value are scattered to the
corresponding streams. The 0-th byte goes to the
diff --git a/output/docs/file-format/data-pages/encodings/index.html
b/output/docs/file-format/data-pages/encodings/index.html
index 2ec9007..9a61291 100644
--- a/output/docs/file-format/data-pages/encodings/index.html
+++ b/output/docs/file-format/data-pages/encodings/index.html
@@ -1,34 +1,31 @@
<!doctype html><html itemscope itemtype=http://schema.org/WebPage lang=en
class=no-js><head><meta charset=utf-8><meta name=viewport
content="width=device-width,initial-scale=1,shrink-to-fit=no"><meta name=robots
content="index, follow"><link rel="shortcut icon"
href=/favicons/favicon.ico><link rel=apple-touch-icon
href=/favicons/apple-touch-icon-180x180.png sizes=180x180><link rel=icon
type=image/png href=/favicons/favicon-16x16.png sizes=16x16><link rel=icon
type=image/png href=/favicon [...]
-Plain: (PLAIN = 0) Supported Types: all
-This is the plain encoding that must be supported for types. It is intended to
be the simplest encoding. Values are encoded back to back.
-The plain encoding is used whenever a more efficient encoding can not be used.
It stores the data in the following format:
-BOOLEAN: Bit Packed, LSB first INT32: 4 bytes little endian INT64: 8 bytes
little endian INT96: 12 bytes little endian (deprecated) FLOAT: 4 bytes IEEE
little endian DOUBLE: 8 bytes IEEE little endian BYTE_ARRAY: length in 4 bytes
little endian followed by the bytes contained in the array
FIXED_LEN_BYTE_ARRAY: the bytes contained in the array For native types, this
outputs the data as little endian. Floating point types are encoded in
IEEE."><meta property="og:url" content="/docs/file-fo [...]
-Plain: (PLAIN = 0) Supported Types: all
-This is the plain encoding that must be supported for types. It is intended to
be the simplest encoding. Values are encoded back to back.
-The plain encoding is used whenever a more efficient encoding can not be used.
It stores the data in the following format:
-BOOLEAN: Bit Packed, LSB first INT32: 4 bytes little endian INT64: 8 bytes
little endian INT96: 12 bytes little endian (deprecated) FLOAT: 4 bytes IEEE
little endian DOUBLE: 8 bytes IEEE little endian BYTE_ARRAY: length in 4 bytes
little endian followed by the bytes contained in the array
FIXED_LEN_BYTE_ARRAY: the bytes contained in the array For native types, this
outputs the data as little endian. Floating point types are encoded in
IEEE."><meta property="og:locale" content="en"><meta [...]
-Plain: (PLAIN = 0) Supported Types: all
-This is the plain encoding that must be supported for types. It is intended to
be the simplest encoding. Values are encoded back to back.
-The plain encoding is used whenever a more efficient encoding can not be used.
It stores the data in the following format:
-BOOLEAN: Bit Packed, LSB first INT32: 4 bytes little endian INT64: 8 bytes
little endian INT96: 12 bytes little endian (deprecated) FLOAT: 4 bytes IEEE
little endian DOUBLE: 8 bytes IEEE little endian BYTE_ARRAY: length in 4 bytes
little endian followed by the bytes contained in the array
FIXED_LEN_BYTE_ARRAY: the bytes contained in the array For native types, this
outputs the data as little endian. Floating point types are encoded in
IEEE."><meta itemprop=dateModified content="2025-12-0 [...]
-Plain: (PLAIN = 0) Supported Types: all
-This is the plain encoding that must be supported for types. It is intended to
be the simplest encoding. Values are encoded back to back.
-The plain encoding is used whenever a more efficient encoding can not be used.
It stores the data in the following format:
-BOOLEAN: Bit Packed, LSB first INT32: 4 bytes little endian INT64: 8 bytes
little endian INT96: 12 bytes little endian (deprecated) FLOAT: 4 bytes IEEE
little endian DOUBLE: 8 bytes IEEE little endian BYTE_ARRAY: length in 4 bytes
little endian followed by the bytes contained in the array
FIXED_LEN_BYTE_ARRAY: the bytes contained in the array For native types, this
outputs the data as little endian. Floating point types are encoded in
IEEE."><link rel=preload href=/scss/main.min.cd2b2ff1 [...]
+Unless otherwise stated in page or encoding documentation, any encoding can be
used with any page type.
+Supported Encodings For details on current implementation status, see the
Implementation Status page.
+Encoding type Encoding enum Supported Types Plain PLAIN = 0 All Physical Types
Dictionary Encoding PLAIN_DICTIONARY = 2 (Deprecated) RLE_DICTIONARY = 8 All
Physical Types Run Length Encoding / Bit-Packing Hybrid RLE = 3 BOOLEAN,
Dictionary Indices Delta Encoding DELTA_BINARY_PACKED = 5 INT32, INT64
Delta-length byte array DELTA_LENGTH_BYTE_ARRAY = 6 BYTE_ARRAY Delta Strings
DELTA_BYTE_ARRAY = 7 BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY Byte Stream Split
BYTE_STREAM_SPLIT = 9 INT32, INT64, FLOAT, [...]
+Unless otherwise stated in page or encoding documentation, any encoding can be
used with any page type.
+Supported Encodings For details on current implementation status, see the
Implementation Status page.
+Encoding type Encoding enum Supported Types Plain PLAIN = 0 All Physical Types
Dictionary Encoding PLAIN_DICTIONARY = 2 (Deprecated) RLE_DICTIONARY = 8 All
Physical Types Run Length Encoding / Bit-Packing Hybrid RLE = 3 BOOLEAN,
Dictionary Indices Delta Encoding DELTA_BINARY_PACKED = 5 INT32, INT64
Delta-length byte array DELTA_LENGTH_BYTE_ARRAY = 6 BYTE_ARRAY Delta Strings
DELTA_BYTE_ARRAY = 7 BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY Byte Stream Split
BYTE_STREAM_SPLIT = 9 INT32, INT64, FLOAT, [...]
+Unless otherwise stated in page or encoding documentation, any encoding can be
used with any page type.
+Supported Encodings For details on current implementation status, see the
Implementation Status page.
+Encoding type Encoding enum Supported Types Plain PLAIN = 0 All Physical Types
Dictionary Encoding PLAIN_DICTIONARY = 2 (Deprecated) RLE_DICTIONARY = 8 All
Physical Types Run Length Encoding / Bit-Packing Hybrid RLE = 3 BOOLEAN,
Dictionary Indices Delta Encoding DELTA_BINARY_PACKED = 5 INT32, INT64
Delta-length byte array DELTA_LENGTH_BYTE_ARRAY = 6 BYTE_ARRAY Delta Strings
DELTA_BYTE_ARRAY = 7 BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY Byte Stream Split
BYTE_STREAM_SPLIT = 9 INT32, INT64, FLOAT, [...]
+Unless otherwise stated in page or encoding documentation, any encoding can be
used with any page type.
+Supported Encodings For details on current implementation status, see the
Implementation Status page.
+Encoding type Encoding enum Supported Types Plain PLAIN = 0 All Physical Types
Dictionary Encoding PLAIN_DICTIONARY = 2 (Deprecated) RLE_DICTIONARY = 8 All
Physical Types Run Length Encoding / Bit-Packing Hybrid RLE = 3 BOOLEAN,
Dictionary Indices Delta Encoding DELTA_BINARY_PACKED = 5 INT32, INT64
Delta-length byte array DELTA_LENGTH_BYTE_ARRAY = 6 BYTE_ARRAY Delta Strings
DELTA_BYTE_ARRAY = 7 BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY Byte Stream Split
BYTE_STREAM_SPLIT = 9 INT32, INT64, FLOAT, [...]
<a
href=https://github.com/apache/parquet-site/edit/production/content/en/docs/File%20Format/Data%20Pages/encodings.md
class="td-page-meta--edit td-page-meta__edit" target=_blank rel=noopener><i
class="fa-solid fa-pen-to-square fa-fw"></i> Edit this page</a>
<a
href="https://github.com/apache/parquet-site/new/production/content/en/docs/File%20Format/Data%20Pages?filename=change-me.md&value=---%0Atitle%3A+%22Long+Page+Title%22%0AlinkTitle%3A+%22Short+Nav+Title%22%0Aweight%3A+100%0Adescription%3A+%3E-%0A+++++Page+description+for+heading+and+indexes.%0A---%0A%0A%23%23+Heading%0A%0AEdit+this+template+to+create+your+new+page.%0A%0A%2A+Give+it+a+good+name%2C+ending+in+%60.md%60+-+e.g.+%60getting-started.md%60%0A%2A+Edit+the+%22front+matter%22+
[...]
<a href="https://github.com/apache/parquet-site/issues/new?title="
class="td-page-meta--issue td-page-meta__issue" target=_blank rel=noopener><i
class="fa-solid fa-list-check fa-fw"></i> Create documentation issue</a>
-<a id=print href=/_print/docs/file-format/data-pages/><i class="fa-solid
fa-print fa-fw"></i> Print entire section</a></div><div class=td-toc
data-proofer-ignore></div></aside><main class="col-12 col-md-9 col-xl-8
ps-md-5" role=main><nav aria-label=breadcrumb class=td-breadcrumbs><ol
class=breadcrumb><li class=breadcrumb-item><a
href=/docs/>Documentation</a></li><li class=breadcrumb-item><a
href=/docs/file-format/>File Format</a></li><li class=breadcrumb-item><a
href=/docs/file-format/da [...]
+<a id=print href=/_print/docs/file-format/data-pages/><i class="fa-solid
fa-print fa-fw"></i> Print entire section</a></div><div class=td-toc
data-proofer-ignore></div></aside><main class="col-12 col-md-9 col-xl-8
ps-md-5" role=main><nav aria-label=breadcrumb class=td-breadcrumbs><ol
class=breadcrumb><li class=breadcrumb-item><a
href=/docs/>Documentation</a></li><li class=breadcrumb-item><a
href=/docs/file-format/>File Format</a></li><li class=breadcrumb-item><a
href=/docs/file-format/da [...]
+used with any page type.</p><h3 id=supported-encodings>Supported
Encodings</h3><p>For details on current implementation status, see the <a
href=https://parquet.apache.org/docs/file-format/implementationstatus/#encodings>Implementation
Status</a> page.</p><table><thead><tr><th>Encoding type</th><th>Encoding
enum</th><th>Supported Types</th></tr></thead><tbody><tr><td><a
href=/docs/file-format/data-pages/encodings/#PLAIN>Plain</a></td><td>PLAIN =
0</td><td>All Physical Types</td></tr><tr>< [...]
intended to be the simplest encoding. Values are encoded back to
back.</p><p>The plain encoding is used whenever a more efficient encoding can
not be used. It
stores the data in the following format:</p><ul><li>BOOLEAN: <a
href=/docs/file-format/data-pages/encodings/#BITPACKED>Bit Packed</a>, LSB
first</li><li>INT32: 4 bytes little endian</li><li>INT64: 8 bytes little
endian</li><li>INT96: 12 bytes little endian (deprecated)</li><li>FLOAT: 4
bytes IEEE little endian</li><li>DOUBLE: 8 bytes IEEE little
endian</li><li>BYTE_ARRAY: length in 4 bytes little endian followed by the
bytes contained in the array</li><li>FIXED_LEN_BYTE_ARRAY: the bytes [...]
point types are encoded in IEEE.</p><p>For the byte array type, it encodes the
length as a 4 byte little
-endian, followed by the bytes.</p><h3
id=dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8>Dictionary
Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8)</h3><p>The dictionary
encoding builds a dictionary of values encountered in a given column. The
+endian, followed by the bytes.</p><p><a name=DICTIONARY></a></p><h3
id=dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8>Dictionary
Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8)</h3><p>The dictionary
encoding builds a dictionary of values encountered in a given column. The
dictionary will be stored in a dictionary page per column chunk. The values
are stored as integers
using the <a href=/docs/file-format/data-pages/encodings/#RLE>RLE/Bit-Packing
Hybrid</a> encoding. If the dictionary grows too big, whether in size
or number of distinct values, the encoding will fall back to the plain
encoding. The dictionary page is
written first, before the data pages of the column chunk.</p><p>Dictionary
page format: the entries in the dictionary using the <a
href=/docs/file-format/data-pages/encodings/#PLAIN>plain</a>
encoding.</p><p>Data page format: the bit width used to encode the entry ids
stored as 1 byte (max bit width = 32),
-followed by the values encoded using RLE/Bit packed described above (with the
given bit width).</p><p>Using the PLAIN_DICTIONARY enum value is deprecated in
the Parquet 2.0 specification. Prefer using RLE_DICTIONARY
-in a data page and PLAIN in a dictionary page for Parquet 2.0+ files.</p><p><a
name=RLE></a></p><h3 id=run-length-encoding--bit-packing-hybrid-rle--3>Run
Length Encoding / Bit-Packing Hybrid (RLE = 3)</h3><p>This encoding uses a
combination of bit-packing and run length encoding to more efficiently store
repeated values.</p><p>The grammar for this encoding looks like this, given a
fixed bit-width known in advance:</p><pre
tabindex=0><code>rle-bit-packed-hybrid: <length> <encoded [...]
+followed by the values encoded using RLE/Bit packed described above (with the
given bit width).</p><p>Using the <code>PLAIN_DICTIONARY</code> enum value is
deprecated, use <code>RLE_DICTIONARY</code>
+in a data page and <code>PLAIN</code> in a dictionary page for new Parquet
files.</p><p><a name=RLE></a></p><h3
id=run-length-encoding--bit-packing-hybrid-rle--3>Run Length Encoding /
Bit-Packing Hybrid (RLE = 3)</h3><p>This encoding uses a combination of
bit-packing and run length encoding to more efficiently store repeated
values.</p><p>The grammar for this encoding looks like this, given a fixed
bit-width known in advance:</p><pre tabindex=0><code>rle-bit-packed-hybrid:
<length> [...]
// length is not always prepended, please check the table below for more detail
length := length of the <encoded-data> in bytes stored as 4 bytes little
endian (unsigned int32)
encoded-data := <run>*
@@ -128,13 +125,13 @@ but in real cases it would be invalid.</p><h4
id=example-1>Example 1</h4><p>1, 2
1 (minimum delta), 0 (bitwidth), (no data needed for bitwidth 0)</p><h4
id=example-2>Example 2</h4><p>7, 5, 3, 1, 2, 3, 4, 5, the deltas would
be</p><p>-2, -2, -2, 1, 1, 1, 1</p><p>The minimum is -2, so the relative deltas
are:</p><p>0, 0, 0, 3, 3, 3, 3</p><p>The encoded data is</p><p>header:
8 (block size), 1 (miniblock count), 8 (value count), 7 (first
value)</p><p>block:
-2 (minimum delta), 2 (bitwidth), 00000011111111b (0,0,0,3,3,3,3 packed on 2
bits)</p><h4 id=characteristics>Characteristics</h4><p>This encoding is similar
to the <a href=/docs/file-format/data-pages/encodings/#RLE>RLE/bit-packing</a>
encoding. However the <a
href=/docs/file-format/data-pages/encodings/#RLE>RLE/bit-packing</a> encoding
is specifically used when the range of ints is small over the entire page, as
is true of repetition and definition levels. It uses a single bit width for
[...]
-The delta encoding algorithm described above stores a bit width per miniblock
and is less sensitive to variations in the size of encoded integers. It is also
somewhat doing RLE encoding as a block containing all the same values will be
bit packed to a zero bit width thus being only a header.</p><h3
id=delta-length-byte-array-delta_length_byte_array--6>Delta-length byte array:
(DELTA_LENGTH_BYTE_ARRAY = 6)</h3><p>Supported Types: BYTE_ARRAY</p><p>This
encoding is always preferred over PLA [...]
+The delta encoding algorithm described above stores a bit width per miniblock
and is less sensitive to variations in the size of encoded integers. It is also
somewhat doing RLE encoding as a block containing all the same values will be
bit packed to a zero bit width thus being only a header.</p><p><a
name=DELTALENGTH></a></p><h3
id=delta-length-byte-array-delta_length_byte_array--6>Delta-length byte array:
(DELTA_LENGTH_BYTE_ARRAY = 6)</h3><p>Supported Types: BYTE_ARRAY</p><p>This
encodi [...]
encoding (DELTA_BINARY_PACKED). The byte array data follows all of the length
data just
concatenated back to back. The expected savings is from the cost of encoding
the lengths
and possibly better compression in the data (it is no longer interleaved with
the lengths).</p><p>The data stream looks like:</p><pre
tabindex=0><code><Delta Encoded Lengths> <Byte Array Data>
-</code></pre><p>For example, if the data was “Hello”,
“World”, “Foobar”, “ABCDEF”</p><p>then the
encoded data would be comprised of the following
segments:</p><ul><li>DeltaEncoding(5, 5, 6, 6) (the string
lengths)</li><li>“HelloWorldFoobarABCDEF”</li></ul><h3
id=delta-strings-delta_byte_array--7>Delta Strings: (DELTA_BYTE_ARRAY =
7)</h3><p>Supported Types: BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY</p><p>This is also
known as incremental [...]
+</code></pre><p>For example, if the data was “Hello”,
“World”, “Foobar”, “ABCDEF”</p><p>then the
encoded data would be comprised of the following
segments:</p><ul><li>DeltaEncoding(5, 5, 6, 6) (the string
lengths)</li><li>“HelloWorldFoobarABCDEF”</li></ul><p><a
name=DELTASTRING></a></p><h3 id=delta-strings-delta_byte_array--7>Delta
Strings: (DELTA_BYTE_ARRAY = 7)</h3><p>Supported Types: BYTE_ARRAY,
FIXED_LEN_BYTE_ARRAY</p><p>Thi [...]
sequence of strings, store the prefix length of the previous entry plus the
suffix.</p><p>For a longer description, see <a
href=https://en.wikipedia.org/wiki/Incremental_encoding>https://en.wikipedia.org/wiki/Incremental_encoding</a>.</p><p>This
is stored as a sequence of delta-encoded prefix lengths (DELTA_BINARY_PACKED),
followed by
-the suffixes encoded as delta length byte arrays
(DELTA_LENGTH_BYTE_ARRAY).</p><p>For example, if the data was
“axis”, “axle”, “babble”,
“babyhood”</p><p>then the encoded data would be comprised of the
following segments:</p><ul><li>DeltaEncoding(0, 2, 0, 3) (the prefix
lengths)</li><li>DeltaEncoding(4, 2, 6, 5) (the suffix
lengths)</li><li>“axislebabbleyhood”</li></ul><p>Note that, even
for FIXED_LEN_BYTE_ARRAY, all lengths are [...]
+the suffixes encoded as delta length byte arrays
(DELTA_LENGTH_BYTE_ARRAY).</p><p>For example, if the data was
“axis”, “axle”, “babble”,
“babyhood”</p><p>then the encoded data would be comprised of the
following segments:</p><ul><li>DeltaEncoding(0, 2, 0, 3) (the prefix
lengths)</li><li>DeltaEncoding(4, 2, 6, 5) (the suffix
lengths)</li><li>“axislebabbleyhood”</li></ul><p>Note that, even
for FIXED_LEN_BYTE_ARRAY, all lengths are [...]
compression ratio and speed when a compression algorithm is used
afterwards.</p><p>This encoding creates K byte-streams of length N where K is
the size in bytes of the data
type and N is the number of elements in the data sequence. For example, K is 4
for FLOAT
type and 8 for DOUBLE type.</p><p>The bytes of each value are scattered to the
corresponding streams. The 0-th byte goes to the
diff --git a/output/docs/file-format/data-pages/index.xml
b/output/docs/file-format/data-pages/index.xml
index 0ffe855..52938e7 100644
--- a/output/docs/file-format/data-pages/index.xml
+++ b/output/docs/file-format/data-pages/index.xml
@@ -13,25 +13,72 @@ library, without any additional framing or padding. The
information required
for precise allocation of compressed and decompressed buffers is written
in the <code>PageHeader</code>
struct.</p></description></item><item><title/><link>/docs/file-format/data-pages/encodings/</link><pubDate>Mon,
01 Jan 0001 00:00:00
+0000</pubDate><guid>/docs/file-format/data-pages/encodings/</guid><description><h1
id="parquet-encoding-definitions">Parquet encoding definitions</h1>
<p>This file contains the specification of all supported
encodings.</p>
-<p><a name="PLAIN"></a></p>
-<h3 id="plain-plain--0">Plain: (PLAIN = 0)</h3>
-<p>Supported Types: all</p>
-<p>This is the plain encoding that must be supported for types. It is
-intended to be the simplest encoding. Values are encoded back to
back.</p>
-<p>The plain encoding is used whenever a more efficient encoding can not
be used. It
-stores the data in the following format:</p>
-<ul>
-<li>BOOLEAN: <a
href="/docs/file-format/data-pages/encodings/#BITPACKED">Bit
Packed</a>, LSB first</li>
-<li>INT32: 4 bytes little endian</li>
-<li>INT64: 8 bytes little endian</li>
-<li>INT96: 12 bytes little endian (deprecated)</li>
-<li>FLOAT: 4 bytes IEEE little endian</li>
-<li>DOUBLE: 8 bytes IEEE little endian</li>
-<li>BYTE_ARRAY: length in 4 bytes little endian followed by the bytes
contained in the array</li>
-<li>FIXED_LEN_BYTE_ARRAY: the bytes contained in the array</li>
-</ul>
-<p>For native types, this outputs the data as little endian. Floating
-point types are encoded in
IEEE.</p></description></item><item><title/><link>/docs/file-format/data-pages/encryption/</link><pubDate>Mon,
01 Jan 0001 00:00:00
+0000</pubDate><guid>/docs/file-format/data-pages/encryption/</guid><description><h1
id="parquet-modular-encryption">Parquet Modular Encryption</h1>
+<p>Unless otherwise stated in page or encoding documentation, any
encoding can be
+used with any page type.</p>
+<h3 id="supported-encodings">Supported Encodings</h3>
+<p>For details on current implementation status, see the <a
href="https://parquet.apache.org/docs/file-format/implementationstatus/#encodings">Implementation
Status</a> page.</p>
+<table>
+ <thead>
+ <tr>
+ <th>Encoding type</th>
+ <th>Encoding enum</th>
+ <th>Supported Types</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td><a
href="/docs/file-format/data-pages/encodings/#PLAIN">Plain</a></td>
+ <td>PLAIN = 0</td>
+ <td>All Physical Types</td>
+ </tr>
+ <tr>
+ <td><a
href="/docs/file-format/data-pages/encodings/#DICTIONARY">Dictionary
Encoding</a></td>
+ <td>PLAIN_DICTIONARY = 2 (Deprecated) <br> RLE_DICTIONARY =
8</td>
+ <td>All Physical Types</td>
+ </tr>
+ <tr>
+ <td><a href="/docs/file-format/data-pages/encodings/#RLE">Run
Length Encoding / Bit-Packing Hybrid</a></td>
+ <td>RLE = 3</td>
+ <td>BOOLEAN, Dictionary Indices</td>
+ </tr>
+ <tr>
+ <td><a
href="/docs/file-format/data-pages/encodings/#DELTAENC">Delta
Encoding</a></td>
+ <td>DELTA_BINARY_PACKED = 5</td>
+ <td>INT32, INT64</td>
+ </tr>
+ <tr>
+ <td><a
href="/docs/file-format/data-pages/encodings/#DELTALENGTH">Delta-length byte
array</a></td>
+ <td>DELTA_LENGTH_BYTE_ARRAY = 6</td>
+ <td>BYTE_ARRAY</td>
+ </tr>
+ <tr>
+ <td><a
href="/docs/file-format/data-pages/encodings/#DELTASTRING">Delta
Strings</a></td>
+ <td>DELTA_BYTE_ARRAY = 7</td>
+ <td>BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY</td>
+ </tr>
+ <tr>
+ <td><a
href="/docs/file-format/data-pages/encodings/#BYTESTREAMSPLIT">Byte Stream
Split</a></td>
+ <td>BYTE_STREAM_SPLIT = 9</td>
+ <td>INT32, INT64, FLOAT, DOUBLE, FIXED_LEN_BYTE_ARRAY</td>
+ </tr>
+ </tbody>
+</table>
+<h3 id="deprecated-encodings">Deprecated Encodings</h3>
+<table>
+ <thead>
+ <tr>
+ <th>Encoding type</th>
+ <th>Encoding enum</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td><a
href="/docs/file-format/data-pages/encodings/#BITPACKED">Bit-packed
(Deprecated)</a></td>
+ <td>BIT_PACKED = 4</td>
+ </tr>
+ </tbody>
+</table>
+<p><a
name="PLAIN"></a></p></description></item><item><title/><link>/docs/file-format/data-pages/encryption/</link><pubDate>Mon,
01 Jan 0001 00:00:00
+0000</pubDate><guid>/docs/file-format/data-pages/encryption/</guid><description><h1
id="parquet-modular-encryption">Parquet Modular Encryption</h1>
<p>Parquet files containing sensitive information can be protected by
the modular encryption
mechanism that encrypts and authenticates the file data and metadata - while
allowing
for a regular Parquet functionality (columnar projection, predicate pushdown,
encoding
diff --git a/output/index.xml b/output/index.xml
index e01cdd1..9352d89 100644
--- a/output/index.xml
+++ b/output/index.xml
@@ -21,25 +21,72 @@ library, without any additional framing or padding. The
information required
for precise allocation of compressed and decompressed buffers is written
in the <code>PageHeader</code>
struct.</p></description></item><item><title/><link>/docs/file-format/data-pages/encodings/</link><pubDate>Mon,
01 Jan 0001 00:00:00
+0000</pubDate><guid>/docs/file-format/data-pages/encodings/</guid><description><h1
id="parquet-encoding-definitions">Parquet encoding definitions</h1>
<p>This file contains the specification of all supported
encodings.</p>
-<p><a name="PLAIN"></a></p>
-<h3 id="plain-plain--0">Plain: (PLAIN = 0)</h3>
-<p>Supported Types: all</p>
-<p>This is the plain encoding that must be supported for types. It is
-intended to be the simplest encoding. Values are encoded back to
back.</p>
-<p>The plain encoding is used whenever a more efficient encoding can not
be used. It
-stores the data in the following format:</p>
-<ul>
-<li>BOOLEAN: <a
href="/docs/file-format/data-pages/encodings/#BITPACKED">Bit
Packed</a>, LSB first</li>
-<li>INT32: 4 bytes little endian</li>
-<li>INT64: 8 bytes little endian</li>
-<li>INT96: 12 bytes little endian (deprecated)</li>
-<li>FLOAT: 4 bytes IEEE little endian</li>
-<li>DOUBLE: 8 bytes IEEE little endian</li>
-<li>BYTE_ARRAY: length in 4 bytes little endian followed by the bytes
contained in the array</li>
-<li>FIXED_LEN_BYTE_ARRAY: the bytes contained in the array</li>
-</ul>
-<p>For native types, this outputs the data as little endian. Floating
-point types are encoded in
IEEE.</p></description></item><item><title/><link>/docs/file-format/data-pages/encryption/</link><pubDate>Mon,
01 Jan 0001 00:00:00
+0000</pubDate><guid>/docs/file-format/data-pages/encryption/</guid><description><h1
id="parquet-modular-encryption">Parquet Modular Encryption</h1>
+<p>Unless otherwise stated in page or encoding documentation, any
encoding can be
+used with any page type.</p>
+<h3 id="supported-encodings">Supported Encodings</h3>
+<p>For details on current implementation status, see the <a
href="https://parquet.apache.org/docs/file-format/implementationstatus/#encodings">Implementation
Status</a> page.</p>
+<table>
+ <thead>
+ <tr>
+ <th>Encoding type</th>
+ <th>Encoding enum</th>
+ <th>Supported Types</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td><a
href="/docs/file-format/data-pages/encodings/#PLAIN">Plain</a></td>
+ <td>PLAIN = 0</td>
+ <td>All Physical Types</td>
+ </tr>
+ <tr>
+ <td><a
href="/docs/file-format/data-pages/encodings/#DICTIONARY">Dictionary
Encoding</a></td>
+ <td>PLAIN_DICTIONARY = 2 (Deprecated) <br> RLE_DICTIONARY =
8</td>
+ <td>All Physical Types</td>
+ </tr>
+ <tr>
+ <td><a href="/docs/file-format/data-pages/encodings/#RLE">Run
Length Encoding / Bit-Packing Hybrid</a></td>
+ <td>RLE = 3</td>
+ <td>BOOLEAN, Dictionary Indices</td>
+ </tr>
+ <tr>
+ <td><a
href="/docs/file-format/data-pages/encodings/#DELTAENC">Delta
Encoding</a></td>
+ <td>DELTA_BINARY_PACKED = 5</td>
+ <td>INT32, INT64</td>
+ </tr>
+ <tr>
+ <td><a
href="/docs/file-format/data-pages/encodings/#DELTALENGTH">Delta-length byte
array</a></td>
+ <td>DELTA_LENGTH_BYTE_ARRAY = 6</td>
+ <td>BYTE_ARRAY</td>
+ </tr>
+ <tr>
+ <td><a
href="/docs/file-format/data-pages/encodings/#DELTASTRING">Delta
Strings</a></td>
+ <td>DELTA_BYTE_ARRAY = 7</td>
+ <td>BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY</td>
+ </tr>
+ <tr>
+ <td><a
href="/docs/file-format/data-pages/encodings/#BYTESTREAMSPLIT">Byte Stream
Split</a></td>
+ <td>BYTE_STREAM_SPLIT = 9</td>
+ <td>INT32, INT64, FLOAT, DOUBLE, FIXED_LEN_BYTE_ARRAY</td>
+ </tr>
+ </tbody>
+</table>
+<h3 id="deprecated-encodings">Deprecated Encodings</h3>
+<table>
+ <thead>
+ <tr>
+ <th>Encoding type</th>
+ <th>Encoding enum</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td><a
href="/docs/file-format/data-pages/encodings/#BITPACKED">Bit-packed
(Deprecated)</a></td>
+ <td>BIT_PACKED = 4</td>
+ </tr>
+ </tbody>
+</table>
+<p><a
name="PLAIN"></a></p></description></item><item><title/><link>/docs/file-format/data-pages/encryption/</link><pubDate>Mon,
01 Jan 0001 00:00:00
+0000</pubDate><guid>/docs/file-format/data-pages/encryption/</guid><description><h1
id="parquet-modular-encryption">Parquet Modular Encryption</h1>
<p>Parquet files containing sensitive information can be protected by
the modular encryption
mechanism that encrypts and authenticates the file data and metadata - while
allowing
for a regular Parquet functionality (columnar projection, predicate pushdown,
encoding