(parquet-testing) branch master updated: Add BYTE_STREAM_SPLIT data file (#45)

apitrou Tue, 09 Jan 2024 06:26:22 -0800

This is an automated email from the ASF dual-hosted git repository.

apitrou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-testing.git



The following commit(s) were added to refs/heads/master by this push:
     new 4cb3cff  Add BYTE_STREAM_SPLIT data file (#45)
4cb3cff is described below

commit 4cb3cff24c965fb329cdae763eabce47395a68a0
Author: Martin <[email protected]>
AuthorDate: Tue Jan 9 09:26:11 2024 -0500

    Add BYTE_STREAM_SPLIT data file (#45)
    
    
    ---------
    
    Co-authored-by: Antoine Pitrou <[email protected]>
---
 data/README.md                      |  30 ++++++++++++++++++++++++++++++
 data/byte_stream_split.zstd.parquet | Bin 0 -> 4104 bytes
 2 files changed, 30 insertions(+)

diff --git a/data/README.md b/data/README.md
index 69c5e94..c25ee77 100644
--- a/data/README.md
+++ b/data/README.md
@@ -49,6 +49,7 @@
 | float16_nonzeros_and_nans.parquet | Float16 (logical type) column with NaNs 
and nonzero finite min/max values |
 | float16_zeros_and_nans.parquet    | Float16 (logical type) column with NaNs 
and zeros as min/max values. . See [note](#float16-files) below |
 | concatenated_gzip_members.parquet     | 513 UINT64 numbers compressed using 
2 concatenated gzip members in a single data page |
+| byte_stream_split.zstd.parquet | Standard normals with `BYTE_STREAM_SPLIT` 
encoding. See [note](#byte-stream-split) below |
 
 TODO: Document what each file is in the table above.
 
@@ -321,3 +322,32 @@ print(m2.row_group(0).column(0))
 #   total_compressed_size: 76
 #   total_uncompressed_size: 76
 ```
+
+## Byte Stream Split
+
+`byte_stream_split.zstd.parquet` is generated by pyarrow 14.0.2 using the 
following code:
+
+```python
+import pyarrow as pa
+from pyarrow import parquet as pq
+import numpy as np
+
+np.random.seed(0)
+table = pa.Table.from_pydict({
+  'f32': np.random.normal(size=300).astype(np.float32),
+  'f64': np.random.normal(size=300).astype(np.float64),
+})
+
+pq.write_table(
+  table,
+  'byte_stream_split.parquet',
+  version='2.6',
+  compression='zstd',
+  compression_level=22,
+  column_encoding='BYTE_STREAM_SPLIT',
+  use_dictionary=False,
+)
+```
+
+This is a practical case where `BYTE_STREAM_SPLIT` encoding obtains a smaller 
file size than `PLAIN` or dictionary.
+Since the distributions are random normals centered at 0, each byte has 
nontrivial behavior.
diff --git a/data/byte_stream_split.zstd.parquet 
b/data/byte_stream_split.zstd.parquet
new file mode 100644
index 0000000..631d492
Binary files /dev/null and b/data/byte_stream_split.zstd.parquet differ

(parquet-testing) branch master updated: Add BYTE_STREAM_SPLIT data file (#45)

Reply via email to