varun0630 opened a new pull request, #779: URL: https://github.com/apache/arrow-go/pull/779
## Rationale The `klauspost/compress/zstd` encoder currently disables `AllLitEntropyCompression` at `SpeedDefault` (the preset that maps to zstd levels 1–4). Klauspost's encoder short-circuits to storing literals uncompressed when no LZ matches are found, skipping the entropy-coding stage. This is a good tradeoff for genuinely incompressible data (random bytes), but it leaves significant compression on the table for real-world columnar data where LZ match density is low but byte distributions are highly skewed — e.g. parquet INT32 decimal columns whose values cluster in a small range (so the high bytes are mostly zero). Enabling `WithAllLitEntropyCompression(true)` forces entropy coding on literals even without LZ matches, matching the behavior of the C reference implementation (`facebook/zstd`) at the same nominal levels. ## Impact Measured on a real-world parquet workload — TPC-DS `store_sales`, 7 Trino-written files, ~9.5M rows, 23 columns including high-cardinality `Decimal(7,2)` columns — going through Apache Iceberg's compaction path at ZSTD level 3: | Config | Output vs input | |---|---| | klauspost (current default) | +6.11% inflation | | **klauspost + WithAllLitEntropyCompression(true)** | **-1.84% reduction** | | DataDog/zstd (CGo wrapper around C zstd) level 3 | -2.23% reduction | | Trino (JNI, C zstd level 3) — reference | -3.99% reduction | Per-blob benchmark (161 page blobs compressed directly by both implementations at level 3): - klauspost current default: 346,287 KB (66.60% of raw) - klauspost + this fix: 329,249 KB (63.32% of raw) - DataDog/zstd: 329,648 KB (63.40% of raw) With this one-line change, klauspost matches (and slightly beats) the C reference implementation on this workload. Suggested upstream discussion with Klaus Post: https://github.com/klauspost/compress — confirmed during that discussion that enabling `AllLitEntropyCompression` is the intended way to close this gap. This PR applies that setting to arrow-go's zstd codec. ## Trade-off Slightly slower compression on genuinely incompressible data (the case `AllLitEntropyCompression` was disabled for). For parquet workloads, this is typically a non-issue since columns with no structure are rare. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
