[parquet-format] branch master updated: PARQUET-2241: Update wording of BYTE_STREAM_SPLIT encoding (#192)

shangxinli Fri, 10 Feb 2023 14:21:01 -0800

This is an automated email from the ASF dual-hosted git repository.

shangxinli pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git



The following commit(s) were added to refs/heads/master by this push:
     new 230711f  PARQUET-2241: Update wording of BYTE_STREAM_SPLIT encoding 
(#192)
230711f is described below

commit 230711fbfd8d3399cce935a4f39d1be7b6ad5ad5
Author: Gang Wu <ust...@gmail.com>
AuthorDate: Sat Feb 11 06:20:02 2023 +0800

    PARQUET-2241: Update wording of BYTE_STREAM_SPLIT encoding (#192)
---
 Encodings.md | 6 +++++-
 README.md    | 2 +-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/Encodings.md b/Encodings.md
index a84cb02..a70ae6f 100644
--- a/Encodings.md
+++ b/Encodings.md
@@ -319,10 +319,14 @@ This encoding does not reduce the size of the data but 
can lead to a significant
 compression ratio and speed when a compression algorithm is used afterwards.
 
 This encoding creates K byte-streams of length N where K is the size in bytes 
of the data
-type and N is the number of elements in the data sequence.
+type and N is the number of elements in the data sequence. Specifically, K is 
4 for FLOAT
+type and 8 for DOUBLE type.
 The bytes of each value are scattered to the corresponding streams. The 0-th 
byte goes to the
 0-th stream, the 1-st byte goes to the 1-st stream and so on.
 The streams are concatenated in the following order: 0-th stream, 1-st stream, 
etc.
+The total length of encoded streams is K * N bytes. Because it does not have 
any metadata
+to indicate the total length, the end of the streams is also the end of data 
page. No padding
+is allowed inside the data page.
 
 Example:
 Original data is three 32-bit floats and for simplicity we look at their raw 
representation.
diff --git a/README.md b/README.md
index d0f654f..ecacd6e 100644
--- a/README.md
+++ b/README.md
@@ -199,7 +199,7 @@ nothing else.
 
 ## Data Pages
 For data pages, the 3 pieces of information are encoded back to back, after 
the page
-header.
+header. No padding is allowed in the data page.
 In order we have:
  1. repetition levels data
  1. definition levels data

[parquet-format] branch master updated: PARQUET-2241: Update wording of BYTE_STREAM_SPLIT encoding (#192)

Reply via email to