[parquet-format] branch master updated: PARQUET-2231: [Format] Allow DELTA_BYTE_ARRAY for FIXED_LEN_BYTE_ARRAY (#189)

apitrou Thu, 19 Jan 2023 14:20:55 -0800

This is an automated email from the ASF dual-hosted git repository.

apitrou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git



The following commit(s) were added to refs/heads/master by this push:
     new bfc549b  PARQUET-2231: [Format] Allow DELTA_BYTE_ARRAY for 
FIXED_LEN_BYTE_ARRAY (#189)
bfc549b is described below

commit bfc549b93e6927cb1fc425466e4084f76edc6d22
Author: Antoine Pitrou <[email protected]>
AuthorDate: Thu Jan 19 23:20:21 2023 +0100

    PARQUET-2231: [Format] Allow DELTA_BYTE_ARRAY for FIXED_LEN_BYTE_ARRAY 
(#189)
    
    DELTA_BYTE_ARRAY has been supported for FIXED_LEN_BYTE_ARRAY by parquet-mr 
since 2015 (see PARQUET-152).
    Update the spec in consequence.
    
    Also improve wording, markup and add an example.
---
 Encodings.md | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/Encodings.md b/Encodings.md
index 40e2177..a84cb02 100644
--- a/Encodings.md
+++ b/Encodings.md
@@ -280,16 +280,19 @@ concatenated back to back. The expected savings is from 
the cost of encoding the
 and possibly better compression in the data (it is no longer interleaved with 
the lengths).
 
 The data stream looks like:
-
+```
 <Delta Encoded Lengths> <Byte Array Data>
+```
 
-For example, if the data was "Hello", "World", "Foobar", "ABCDEF":
+For example, if the data was "Hello", "World", "Foobar", "ABCDEF"
 
-The encoded data would be DeltaEncoding(5, 5, 6, 6) "HelloWorldFoobarABCDEF"
+then the encoded data would be comprised of the following segments:
+- DeltaEncoding(5, 5, 6, 6) (the string lengths)
+- "HelloWorldFoobarABCDEF"
 
 ### Delta Strings: (DELTA_BYTE_ARRAY = 7)
 
-Supported Types: BYTE_ARRAY
+Supported Types: BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY
 
 This is also known as incremental encoding or front compression: for each 
element in a
 sequence of strings, store the prefix length of the previous entry plus the 
suffix.
@@ -299,9 +302,18 @@ For a longer description, see 
https://en.wikipedia.org/wiki/Incremental_encoding
 This is stored as a sequence of delta-encoded prefix lengths 
(DELTA_BINARY_PACKED), followed by
 the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY).
 
+For example, if the data was "axis", "axle", "babble", "babyhood"
+
+then the encoded data would be comprised of the following segments:
+- DeltaEncoding(0, 2, 0, 3) (the prefix lengths)
+- DeltaEncoding(4, 2, 6, 5) (the suffix lengths)
+- "axislebabbleyhood"
+
+Note that, even for FIXED_LEN_BYTE_ARRAY, all lengths are encoded despite the 
redundancy.
+
 ### Byte Stream Split: (BYTE_STREAM_SPLIT = 9)
 
-Supported Types: FLOAT DOUBLE
+Supported Types: FLOAT, DOUBLE
 
 This encoding does not reduce the size of the data but can lead to a 
significantly better
 compression ratio and speed when a compression algorithm is used afterwards.

[parquet-format] branch master updated: PARQUET-2231: [Format] Allow DELTA_BYTE_ARRAY for FIXED_LEN_BYTE_ARRAY (#189)

Reply via email to