[PR] [SPARK-57383][SQL][PYTHON][4.1] Honor configured Arrow zstd compression level when writing Arrow batches [spark]

via GitHub Thu, 11 Jun 2026 09:04:44 -0700


viirya opened a new pull request, #56451:
URL: https://github.com/apache/spark/pull/56451


   ### What changes were proposed in this pull request?
   
   Backport of #56444 (commit e33017a9c62) to branch-4.1.
   
   This fixes a bug where the zstd compression level configured via 
`spark.sql.execution.arrow.compression.zstd.level` was silently ignored 
everywhere Arrow batches are compressed. The affected code constructed `new 
ZstdCompressionCodec(level)` only to read its codec type, then rebuilt the 
codec through `CompressionCodec.Factory.INSTANCE.createCodec(codecType)`. The 
codec type enum does not carry a level, so that single-argument factory 
overload always builds a codec at the zstd default level (3), dropping the 
configured one.
   
   Compared to the master patch, this backport fixes one additional copy of the 
broken pattern: on branch-4.1 `GroupedPythonArrowInput` still has its own codec 
construction in `createUnloaderForGroup` (the SPARK-55328 deduplication that 
makes it reuse `PythonArrowInput.codec` is master/4.2-only). The four fixed 
sites are:
   
   - `ArrowConverters.ArrowBatchIterator`
   - `PythonArrowInput`
   - `GroupedPythonArrowInput` (branch-4.1 only; folded into `PythonArrowInput` 
on master)
   - `CoGroupedArrowPythonRunner`
   
   All four now use the shared `ArrowCompressionUtils.createCompressionCodec` 
helper, which constructs the level-carrying codec instance directly. The level 
only matters on the write side; the read side looks up the codec by the type 
recorded in the IPC message, so reads are unaffected and the on-wire format is 
unchanged.
   
   ### Why are the changes needed?
   
   Users tuning `spark.sql.execution.arrow.compression.zstd.level` for Python 
UDF exchange or `df.toArrow()` got no effect at all: every level compressed 
identically at the default level 3, with no error or warning. Released 
4.1.0/4.1.1/4.1.2 are affected.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. The configured zstd level now actually takes effect; previously all 
levels behaved like the default level 3.
   
   ### How was this patch tested?
   
   `ArrowCompressionUtilsSuite`, brought over with the backport. The regression 
test compresses the same compressible-but-varying batch at zstd level -5 and 
level 19 and asserts level 19 produces a strictly smaller payload; against the 
old codec construction it fails with byte-identical sizes at both levels. 
Verified the suite passes on this branch.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-57383][SQL][PYTHON][4.1] Honor configured Arrow zstd compression level when writing Arrow batches [spark]

Reply via email to