(parquet-format) branch master updated: PARQUET-2414: Extend BYTE_STREAM_SPLIT to support INT32, INT64 and FIXED_LEN_BYTE_ARRAY data (#229)

apitrou Mon, 18 Mar 2024 03:42:13 -0700

This is an automated email from the ASF dual-hosted git repository.

apitrou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git



The following commit(s) were added to refs/heads/master by this push:
     new e517ac4  PARQUET-2414: Extend BYTE_STREAM_SPLIT to support INT32, 
INT64 and FIXED_LEN_BYTE_ARRAY data (#229)
e517ac4 is described below

commit e517ac4dbe08d518eb5c2e58576d4c711973db94
Author: Antoine Pitrou <[email protected]>
AuthorDate: Mon Mar 18 11:41:22 2024 +0100

    PARQUET-2414: Extend BYTE_STREAM_SPLIT to support INT32, INT64 and 
FIXED_LEN_BYTE_ARRAY data (#229)
---
 CHANGES.md                     | 6 ++++++
 Encodings.md                   | 5 +++--
 src/main/thrift/parquet.thrift | 7 +++++--
 3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/CHANGES.md b/CHANGES.md
index 4002000..7bbce7c 100644
--- a/CHANGES.md
+++ b/CHANGES.md
@@ -19,6 +19,12 @@
 
 # Parquet #
 
+### Version 2.11.0 ###
+
+#### New Feature
+
+*   [PARQUET-2414](https://issues.apache.org/jira/browse/PARQUET-2414) - 
Extend BYTE_STREAM_SPLIT to support INT32, INT64 and FIXED_LEN_BYTE_ARRAY data
+
 ### Version 2.10.0 ###
 
 #### New Feature
diff --git a/Encodings.md b/Encodings.md
index 5040094..ea7e4e3 100644
--- a/Encodings.md
+++ b/Encodings.md
@@ -337,14 +337,15 @@ Note that, even for FIXED_LEN_BYTE_ARRAY, all lengths are 
encoded despite the re
 
 ### Byte Stream Split: (BYTE_STREAM_SPLIT = 9)
 
-Supported Types: FLOAT, DOUBLE
+Supported Types: FLOAT, DOUBLE, INT32, INT64, FIXED_LEN_BYTE_ARRAY
 
 This encoding does not reduce the size of the data but can lead to a 
significantly better
 compression ratio and speed when a compression algorithm is used afterwards.
 
 This encoding creates K byte-streams of length N where K is the size in bytes 
of the data
-type and N is the number of elements in the data sequence. Specifically, K is 
4 for FLOAT
+type and N is the number of elements in the data sequence. For example, K is 4 
for FLOAT
 type and 8 for DOUBLE type.
+
 The bytes of each value are scattered to the corresponding streams. The 0-th 
byte goes to the
 0-th stream, the 1-st byte goes to the 1-st stream and so on.
 The streams are concatenated in the following order: 0-th stream, 1-st stream, 
etc.
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 2084ac6..27d4043 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -526,12 +526,15 @@ enum Encoding {
    */
   RLE_DICTIONARY = 8;
 
-  /** Encoding for floating-point data.
+  /** Encoding for fixed-width data (FLOAT, DOUBLE, INT32, INT64, 
FIXED_LEN_BYTE_ARRAY).
       K byte-streams are created where K is the size in bytes of the data type.
-      The individual bytes of an FP value are scattered to the corresponding 
stream and
+      The individual bytes of a value are scattered to the corresponding 
stream and
       the streams are concatenated.
       This itself does not reduce the size of the data but can lead to better 
compression
       afterwards.
+
+      Added in 2.8 for FLOAT and DOUBLE.
+      Support for INT32, INT64 and FIXED_LEN_BYTE_ARRAY added in 2.11.
    */
   BYTE_STREAM_SPLIT = 9;
 }

(parquet-format) branch master updated: PARQUET-2414: Extend BYTE_STREAM_SPLIT to support INT32, INT64 and FIXED_LEN_BYTE_ARRAY data (#229)

Reply via email to