[PR] [SPARK-57383][SQL][PYTHON] Honor configured Arrow zstd compression level when writing Arrow batches [spark]

via GitHub Wed, 10 Jun 2026 18:55:04 -0700


viirya opened a new pull request, #56444:
URL: https://github.com/apache/spark/pull/56444


   ### What changes were proposed in this pull request?
   
   This PR fixes a bug where the zstd compression level configured via 
`spark.sql.execution.arrow.compression.zstd.level` was silently ignored 
everywhere Arrow batches are compressed. Three places shared the same broken 
pattern:
   
   - `ArrowConverters.ArrowBatchIterator` (SPARK-54134)
   - `PythonArrowInput` (SPARK-54226; also covers `GroupedPythonArrowInput`, 
which reuses this codec via SPARK-55328)
   - `CoGroupedArrowPythonRunner` (SPARK-54226)
   
   They constructed `new ZstdCompressionCodec(level)` only to read its codec 
type, then rebuilt the codec through 
`CompressionCodec.Factory.INSTANCE.createCodec(codecType)`. The codec type enum 
does not carry a level, so that single-argument factory overload always builds 
a codec at the zstd default level (3), dropping the configured one.
   
   The codec construction is extracted into a shared 
`ArrowCompressionUtils.createCompressionCodec` helper that constructs the 
level-carrying codec instance directly (the helper lives in `sql/core` because 
`sql/api`, where `ArrowUtils` is, has no `arrow-compression` dependency). The 
level only matters on the write side; the read side looks up the codec by the 
type recorded in the IPC message, so reads are unaffected and the on-wire 
format is unchanged.
   
   The same bug class was found by @dbtsai during review of #56334 
(https://github.com/apache/spark/pull/56334#discussion_r3391654988); that PR 
fixes the cache-side instance of the pattern, and this PR fixes the remaining 
three pre-existing instances.
   
   ### Why are the changes needed?
   
   Users tuning `spark.sql.execution.arrow.compression.zstd.level` for Python 
UDF exchange or `df.toArrow()` got no effect at all: every level compressed 
identically at the default level 3, with no error or warning.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, though only in unreleased versions (the affected code is in 4.2.0 RCs 
and master, not in any released Spark). The configured zstd level now actually 
takes effect; previously all levels behaved like the default level 3.
   
   ### How was this patch tested?
   
   New `ArrowCompressionUtilsSuite`. The regression test compresses the same 
compressible-but-varying batch at zstd level -5 and level 19 and asserts level 
19 produces a strictly smaller payload. Against the old codec construction this 
test fails with byte-identical sizes at both levels (verified locally). A 
second test covers the `none` codec and the unsupported-codec error.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-57383][SQL][PYTHON] Honor configured Arrow zstd compression level when writing Arrow batches [spark]

Reply via email to