(parquet-testing) branch master updated: PARQUET-2414: Add test file for additional BYTE_STREAM_SPLIT types (#46)

apitrou Mon, 18 Mar 2024 03:45:07 -0700

This is an automated email from the ASF dual-hosted git repository.

apitrou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-testing.git



The following commit(s) were added to refs/heads/master by this push:
     new 74278bc  PARQUET-2414: Add test file for additional BYTE_STREAM_SPLIT 
types (#46)
74278bc is described below

commit 74278bc4a1122d74945969e6dec405abd1533ec3
Author: Antoine Pitrou <[email protected]>
AuthorDate: Mon Mar 18 11:42:46 2024 +0100

    PARQUET-2414: Add test file for additional BYTE_STREAM_SPLIT types (#46)
    
    Add a new data file that allows exercising BYTE_STREAM_SPLIT for all 
supported types:
    FLOAT, DOUBLE, INT32, INT64, FIXED_LEN_BYTE_ARRAY (the latter with several 
widths and logical types).
    
    For each type, two columns are provided with the same values: one 
PLAIN-encoded, the other BYTE_STREAM_SPLIT-encoded.
---
 data/README.md                               |  36 +++++++++++++++++++++++++++
 data/byte_stream_split_extended.gzip.parquet | Bin 0 -> 15659 bytes
 2 files changed, 36 insertions(+)

diff --git a/data/README.md b/data/README.md
index c25ee77..f805c8b 100644
--- a/data/README.md
+++ b/data/README.md
@@ -325,6 +325,8 @@ print(m2.row_group(0).column(0))
 
 ## Byte Stream Split
 
+# FLOAT and DOUBLE data
+
 `byte_stream_split.zstd.parquet` is generated by pyarrow 14.0.2 using the 
following code:
 
 ```python
@@ -351,3 +353,37 @@ pq.write_table(
 
 This is a practical case where `BYTE_STREAM_SPLIT` encoding obtains a smaller 
file size than `PLAIN` or dictionary.
 Since the distributions are random normals centered at 0, each byte has 
nontrivial behavior.
+
+# Additional types
+
+`byte_stream_split_extended.gzip.parquet` is generated by pyarrow 16.0.0.
+It contains 7 pairs of columns, each in two variants containing the same
+values: one `PLAIN`-encoded and one `BYTE_STREAM_SPLIT`-encoded:
+```
+Version: 2.6
+Created By: parquet-cpp-arrow version 16.0.0-SNAPSHOT
+Total rows: 200
+Number of RowGroups: 1
+Number of Real Columns: 14
+Number of Columns: 14
+Number of Selected Columns: 14
+Column 0: float16_plain (FIXED_LEN_BYTE_ARRAY(2) / Float16)
+Column 1: float16_byte_stream_split (FIXED_LEN_BYTE_ARRAY(2) / Float16)
+Column 2: float_plain (FLOAT)
+Column 3: float_byte_stream_split (FLOAT)
+Column 4: double_plain (DOUBLE)
+Column 5: double_byte_stream_split (DOUBLE)
+Column 6: int32_plain (INT32)
+Column 7: int32_byte_stream_split (INT32)
+Column 8: int64_plain (INT64)
+Column 9: int64_byte_stream_split (INT64)
+Column 10: flba5_plain (FIXED_LEN_BYTE_ARRAY(5))
+Column 11: flba5_byte_stream_split (FIXED_LEN_BYTE_ARRAY(5))
+Column 12: decimal_plain (FIXED_LEN_BYTE_ARRAY(4) / Decimal(precision=7, 
scale=3) / DECIMAL(7,3))
+Column 13: decimal_byte_stream_split (FIXED_LEN_BYTE_ARRAY(4) / 
Decimal(precision=7, scale=3) / DECIMAL(7,3))
+```
+
+To check conformance of a `BYTE_STREAM_SPLIT` decoder, read each
+`BYTE_STREAM_SPLIT`-encoded column and compare the decoded values against
+the values from the corresponding `PLAIN`-encoded column. The values should
+be equal.
diff --git a/data/byte_stream_split_extended.gzip.parquet 
b/data/byte_stream_split_extended.gzip.parquet
new file mode 100644
index 0000000..41f286f
Binary files /dev/null and b/data/byte_stream_split_extended.gzip.parquet differ

(parquet-testing) branch master updated: PARQUET-2414: Add test file for additional BYTE_STREAM_SPLIT types (#46)

Reply via email to