viirya opened a new pull request, #56451: URL: https://github.com/apache/spark/pull/56451
### What changes were proposed in this pull request? Backport of #56444 (commit e33017a9c62) to branch-4.1. This fixes a bug where the zstd compression level configured via `spark.sql.execution.arrow.compression.zstd.level` was silently ignored everywhere Arrow batches are compressed. The affected code constructed `new ZstdCompressionCodec(level)` only to read its codec type, then rebuilt the codec through `CompressionCodec.Factory.INSTANCE.createCodec(codecType)`. The codec type enum does not carry a level, so that single-argument factory overload always builds a codec at the zstd default level (3), dropping the configured one. Compared to the master patch, this backport fixes one additional copy of the broken pattern: on branch-4.1 `GroupedPythonArrowInput` still has its own codec construction in `createUnloaderForGroup` (the SPARK-55328 deduplication that makes it reuse `PythonArrowInput.codec` is master/4.2-only). The four fixed sites are: - `ArrowConverters.ArrowBatchIterator` - `PythonArrowInput` - `GroupedPythonArrowInput` (branch-4.1 only; folded into `PythonArrowInput` on master) - `CoGroupedArrowPythonRunner` All four now use the shared `ArrowCompressionUtils.createCompressionCodec` helper, which constructs the level-carrying codec instance directly. The level only matters on the write side; the read side looks up the codec by the type recorded in the IPC message, so reads are unaffected and the on-wire format is unchanged. ### Why are the changes needed? Users tuning `spark.sql.execution.arrow.compression.zstd.level` for Python UDF exchange or `df.toArrow()` got no effect at all: every level compressed identically at the default level 3, with no error or warning. Released 4.1.0/4.1.1/4.1.2 are affected. ### Does this PR introduce _any_ user-facing change? Yes. The configured zstd level now actually takes effect; previously all levels behaved like the default level 3. ### How was this patch tested? `ArrowCompressionUtilsSuite`, brought over with the backport. The regression test compresses the same compressible-but-varying batch at zstd level -5 and level 19 and asserts level 19 produces a strictly smaller payload; against the old codec construction it fails with byte-identical sizes at both levels. Verified the suite passes on this branch. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
