This is an automated email from the ASF dual-hosted git repository. shangxinli pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/parquet-format.git
The following commit(s) were added to refs/heads/master by this push: new 230711f PARQUET-2241: Update wording of BYTE_STREAM_SPLIT encoding (#192) 230711f is described below commit 230711fbfd8d3399cce935a4f39d1be7b6ad5ad5 Author: Gang Wu <ust...@gmail.com> AuthorDate: Sat Feb 11 06:20:02 2023 +0800 PARQUET-2241: Update wording of BYTE_STREAM_SPLIT encoding (#192) --- Encodings.md | 6 +++++- README.md | 2 +- 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/Encodings.md b/Encodings.md index a84cb02..a70ae6f 100644 --- a/Encodings.md +++ b/Encodings.md @@ -319,10 +319,14 @@ This encoding does not reduce the size of the data but can lead to a significant compression ratio and speed when a compression algorithm is used afterwards. This encoding creates K byte-streams of length N where K is the size in bytes of the data -type and N is the number of elements in the data sequence. +type and N is the number of elements in the data sequence. Specifically, K is 4 for FLOAT +type and 8 for DOUBLE type. The bytes of each value are scattered to the corresponding streams. The 0-th byte goes to the 0-th stream, the 1-st byte goes to the 1-st stream and so on. The streams are concatenated in the following order: 0-th stream, 1-st stream, etc. +The total length of encoded streams is K * N bytes. Because it does not have any metadata +to indicate the total length, the end of the streams is also the end of data page. No padding +is allowed inside the data page. Example: Original data is three 32-bit floats and for simplicity we look at their raw representation. diff --git a/README.md b/README.md index d0f654f..ecacd6e 100644 --- a/README.md +++ b/README.md @@ -199,7 +199,7 @@ nothing else. ## Data Pages For data pages, the 3 pieces of information are encoded back to back, after the page -header. +header. No padding is allowed in the data page. In order we have: 1. repetition levels data 1. definition levels data