This is an automated email from the ASF dual-hosted git repository. gabor pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/parquet-format.git
The following commit(s) were added to refs/heads/master by this push: new ee02ef8 PARQUET-1622: Add BYTE_STREAM_SPLIT encoding (#144) ee02ef8 is described below commit ee02ef8c8f33bd3d5ed0582ded7e20439e12d933 Author: martinradev <martin.b.ra...@gmail.com> AuthorDate: Tue Dec 3 08:34:53 2019 +0000 PARQUET-1622: Add BYTE_STREAM_SPLIT encoding (#144) The patch extends the format to add the BYTE_STREAM_SPLIT encoding and adds documentation for it. --- Encodings.md | 24 ++++++++++++++++++++++++ src/main/thrift/parquet.thrift | 9 +++++++++ 2 files changed, 33 insertions(+) diff --git a/Encodings.md b/Encodings.md index 236d8b2..4f56104 100644 --- a/Encodings.md +++ b/Encodings.md @@ -261,3 +261,27 @@ For a longer description, see https://en.wikipedia.org/wiki/Incremental_encoding This is stored as a sequence of delta-encoded prefix lengths (DELTA_BINARY_PACKED), followed by the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY). + +### Byte Stream Split: (BYTE_STREAM_SPLIT = 9) + +Supported Types: FLOAT DOUBLE + +This encoding does not reduce the size of the data but can lead to a significantly better +compression ratio and speed when a compression algorithm is used afterwards. + +This encoding creates K byte-streams of length N where K is the size in bytes of the data +type and N is the number of elements in the data sequence. +The bytes of each value are scattered to the corresponding streams. The 0-th byte goes to the +0-th stream, the 1-st byte goes to the 1-st stream and so on. +The streams are concatenated in the following order: 0-th stream, 1-st stream, etc. + +Example: +Original data is three 32-bit floats and for simplicity we look at their raw representation. +``` + Element 0 Element 1 Element 2 +Bytes AA BB CC DD 00 11 22 33 A3 B4 C5 D6 +``` +After applying the transformation, the data has the following representation: +``` +Bytes AA 00 A3 BB 11 B4 CC 22 C5 DD 33 D6 +``` diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index 68820ca..0c1a8ea 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -457,6 +457,15 @@ enum Encoding { /** Dictionary encoding: the ids are encoded using the RLE encoding */ RLE_DICTIONARY = 8; + + /** Encoding for floating-point data. + K byte-streams are created where K is the size in bytes of the data type. + The individual bytes of an FP value are scattered to the corresponding stream and + the streams are concatenated. + This itself does not reduce the size of the data but can lead to better compression + afterwards. + */ + BYTE_STREAM_SPLIT = 9; } /**