This is an automated email from the ASF dual-hosted git repository.
apitrou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-testing.git
The following commit(s) were added to refs/heads/master by this push:
new 4cb3cff Add BYTE_STREAM_SPLIT data file (#45)
4cb3cff is described below
commit 4cb3cff24c965fb329cdae763eabce47395a68a0
Author: Martin <[email protected]>
AuthorDate: Tue Jan 9 09:26:11 2024 -0500
Add BYTE_STREAM_SPLIT data file (#45)
---------
Co-authored-by: Antoine Pitrou <[email protected]>
---
data/README.md | 30 ++++++++++++++++++++++++++++++
data/byte_stream_split.zstd.parquet | Bin 0 -> 4104 bytes
2 files changed, 30 insertions(+)
diff --git a/data/README.md b/data/README.md
index 69c5e94..c25ee77 100644
--- a/data/README.md
+++ b/data/README.md
@@ -49,6 +49,7 @@
| float16_nonzeros_and_nans.parquet | Float16 (logical type) column with NaNs
and nonzero finite min/max values |
| float16_zeros_and_nans.parquet | Float16 (logical type) column with NaNs
and zeros as min/max values. . See [note](#float16-files) below |
| concatenated_gzip_members.parquet | 513 UINT64 numbers compressed using
2 concatenated gzip members in a single data page |
+| byte_stream_split.zstd.parquet | Standard normals with `BYTE_STREAM_SPLIT`
encoding. See [note](#byte-stream-split) below |
TODO: Document what each file is in the table above.
@@ -321,3 +322,32 @@ print(m2.row_group(0).column(0))
# total_compressed_size: 76
# total_uncompressed_size: 76
```
+
+## Byte Stream Split
+
+`byte_stream_split.zstd.parquet` is generated by pyarrow 14.0.2 using the
following code:
+
+```python
+import pyarrow as pa
+from pyarrow import parquet as pq
+import numpy as np
+
+np.random.seed(0)
+table = pa.Table.from_pydict({
+ 'f32': np.random.normal(size=300).astype(np.float32),
+ 'f64': np.random.normal(size=300).astype(np.float64),
+})
+
+pq.write_table(
+ table,
+ 'byte_stream_split.parquet',
+ version='2.6',
+ compression='zstd',
+ compression_level=22,
+ column_encoding='BYTE_STREAM_SPLIT',
+ use_dictionary=False,
+)
+```
+
+This is a practical case where `BYTE_STREAM_SPLIT` encoding obtains a smaller
file size than `PLAIN` or dictionary.
+Since the distributions are random normals centered at 0, each byte has
nontrivial behavior.
diff --git a/data/byte_stream_split.zstd.parquet
b/data/byte_stream_split.zstd.parquet
new file mode 100644
index 0000000..631d492
Binary files /dev/null and b/data/byte_stream_split.zstd.parquet differ