iemejia opened a new pull request, #55930:
URL: https://github.com/apache/spark/pull/55930

   ### What changes were proposed in this pull request?
   
   Documents and tests that Spark supports writing Parquet files with 
BYTE_STREAM_SPLIT encoding for FLOAT and DOUBLE columns. This encoding 
de-interleaves value bytes into per-position streams, making each stream highly 
compressible -- float/double columns typically see 2-4x better compression than 
PLAIN+zstd for time-series and scientific data.
   
   No Spark code changes are needed because parquet-mr (1.17.0) already 
includes the BYTE_STREAM_SPLIT encoder, and Spark's existing configuration 
passthrough mechanism (`DataSourceUtils.mergeWriteOptionsIntoHadoopConf`) 
already forwards arbitrary `parquet.*` properties to the writer. Setting 
`parquet.enable.bytestreamsplit=true` (with dictionary disabled) activates BSS 
encoding for FLOAT and DOUBLE columns.
   
   This PR adds a test to `ParquetEncodingSuite` that:
   1. Writes 8193 rows with INT32, INT64, FLOAT, DOUBLE, and nullable 
FLOAT/DOUBLE columns using BSS encoding
   2. Verifies via Parquet metadata that FLOAT/DOUBLE columns use 
`BYTE_STREAM_SPLIT` encoding while INT32/INT64 columns do not (the boolean flag 
only enables the `FLOATING_POINT` mode)
   3. Reads data back and verifies round-trip correctness including null 
handling
   
   Users can enable BSS encoding via any of:
   - `.option("parquet.enable.bytestreamsplit", "true")` on `DataFrameWriter`
   - `withSQLConf("parquet.enable.bytestreamsplit" -> "true")`
   - `spark.hadoop.parquet.enable.bytestreamsplit=true` in SparkConf
   
   Dictionary encoding must be disabled for BSS to take effect (BSS replaces 
the fallback PLAIN encoding, not dictionary encoding).
   
   ### Why are the changes needed?
   
   BYTE_STREAM_SPLIT encoding is particularly effective for floating-point data 
in time-series, scientific, and IoT workloads. The encoding is already 
supported by parquet-mr and already works through Spark's config passthrough, 
but this was undocumented and untested. This PR adds test coverage to prevent 
regressions and documents the capability.
   
   The read-side vectorized decoder for BYTE_STREAM_SPLIT was added in 
SPARK-56894 (PR #55921).
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. The capability already works through the existing config passthrough. 
This PR only adds test coverage.
   
   ### How was this patch tested?
   
   New test case in `ParquetEncodingSuite`: "BYTE_STREAM_SPLIT encoding for 
float and double columns". Tests write + metadata verification + round-trip 
read correctness with nullable columns.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: OpenCode with Claude Opus (claude-opus-4.6)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to