varun0630 opened a new pull request, #779:
URL: https://github.com/apache/arrow-go/pull/779

   ## Rationale
   
   The `klauspost/compress/zstd` encoder currently disables 
`AllLitEntropyCompression` at `SpeedDefault` (the preset that maps to zstd 
levels 1–4). Klauspost's encoder short-circuits to storing literals 
uncompressed when no LZ matches are found, skipping the entropy-coding stage. 
This is a good tradeoff for genuinely incompressible data (random bytes), but 
it leaves significant compression on the table for real-world columnar data 
where LZ match density is low but byte distributions are highly skewed — e.g. 
parquet INT32 decimal columns whose values cluster in a small range (so the 
high bytes are mostly zero).
   
   Enabling `WithAllLitEntropyCompression(true)` forces entropy coding on 
literals even without LZ matches, matching the behavior of the C reference 
implementation (`facebook/zstd`) at the same nominal levels.
   
   ## Impact
   
   Measured on a real-world parquet workload — TPC-DS `store_sales`, 7 
Trino-written files, ~9.5M rows, 23 columns including high-cardinality 
`Decimal(7,2)` columns — going through Apache Iceberg's compaction path at ZSTD 
level 3:
   
   | Config | Output vs input |
   |---|---|
   | klauspost (current default) | +6.11% inflation |
   | **klauspost + WithAllLitEntropyCompression(true)** | **-1.84% reduction** |
   | DataDog/zstd (CGo wrapper around C zstd) level 3 | -2.23% reduction |
   | Trino (JNI, C zstd level 3) — reference | -3.99% reduction |
   
   Per-blob benchmark (161 page blobs compressed directly by both 
implementations at level 3):
   - klauspost current default: 346,287 KB (66.60% of raw)
   - klauspost + this fix: 329,249 KB (63.32% of raw)
   - DataDog/zstd: 329,648 KB (63.40% of raw)
   
   With this one-line change, klauspost matches (and slightly beats) the C 
reference implementation on this workload.
   
   Suggested upstream discussion with Klaus Post: 
https://github.com/klauspost/compress — confirmed during that discussion that 
enabling `AllLitEntropyCompression` is the intended way to close this gap. This 
PR applies that setting to arrow-go's zstd codec.
   
   ## Trade-off
   
   Slightly slower compression on genuinely incompressible data (the case 
`AllLitEntropyCompression` was disabled for). For parquet workloads, this is 
typically a non-issue since columns with no structure are rare.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to