[PR] GH-3530: Bypass Hadoop codec abstraction to optimize compression performance [parquet-java]

via GitHub Sun, 17 May 2026 15:39:08 -0700


iemejia opened a new pull request, #3570:
URL: https://github.com/apache/parquet-java/pull/3570


   Part of #3530 — Apache Parquet Java Performance Improvements
   
   ## Summary
   
   Bypass the Hadoop `CompressionCodec` abstraction for all six supported 
codecs, eliminating per-page codec-pool lookups, stream-wrapper allocation, and 
unnecessary buffer copies in both `CodecFactory` and `DirectCodecFactory`.
   
   | Codec | Before | After |
   |-------|--------|-------|
   | **Snappy** | Hadoop `SnappyCodec` stream wrappers | xerial 
`Snappy.compress`/`uncompress` direct calls |
   | **LZ4_RAW** | Hadoop codec abstraction | airlift 
`LZ4Compressor`/`LZ4Decompressor` direct |
   | **ZSTD** | Streaming 
`ZstdOutputStreamNoFinalizer`/`ZstdInputStreamNoFinalizer` | Reusable 
`ZstdCompressCtx`/`ZstdDecompressCtx` single-call APIs |
   | **GZIP** | Hadoop `GzipCodec` with codec-pool overhead | JDK 
`GZIPOutputStream`/`GZIPInputStream` direct |
   | **LZO** | GPL `com.hadoop.compression.lzo.LzoCodec` | aircompressor 
`LzoHadoopStreams` (Apache 2.0, wire-compatible) |
   | **Brotli** | Abandoned `brotli-codec` (jbrotli, 2016, x86-only) | 
`brotli4j` 1.23.0 (10 platforms incl. aarch64, reflection-loaded) |
   
   Notable side effects:
   - **LZO**: Removes GPL dependency; uses Apache 2.0 aircompressor. 
Wire-compatible framing.
   - **Brotli**: Enables aarch64 support (linux, macOS, Windows). Removes 
non-aarch64 Maven profile guards and test skips.
   
   JMH benchmarks: `CompressionBenchmark`, `CpuReadBenchmark`, 
`CpuWriteBenchmark`, `FileReadBenchmark`, `FileWriteBenchmark`, 
`ConcurrentReadWriteBenchmark`.
   
   ## Benchmark results
   
   **Environment**: JDK 25.0.3 (Temurin), OpenJDK 64-Bit Server VM, JMH 1.37, 
Linux x86_64.
   
   **End-to-end file write** (100K rows, SingleShotTime, ms/op lower is better):
   
   | Codec | V1 dict=true | V2 dict=true | V2 Speedup |
   |---|---|---|---:|
   | SNAPPY | 50.6 -> 40.9 (1.24x) | 69.7 -> 38.7 | **1.80x** |
   | ZSTD | 52.3 -> 43.6 (1.20x) | 70.7 -> 40.6 | **1.74x** |
   | LZ4_RAW | 49.6 -> 41.3 (1.20x) | 70.2 -> 39.0 | **1.80x** |
   | GZIP | 149.9 -> 119.3 (1.26x) | 123.4 -> 67.6 | **1.83x** |
   | BROTLI | 55.4 -> 46.8 (1.18x) | 72.8 -> 41.8 | **1.74x** |
   
   **End-to-end file read** (ms/op lower is better):
   
   | Codec | V1 Speedup | V2 Speedup |
   |---|---:|---:|
   | SNAPPY | **1.50x** | **1.61x** |
   | ZSTD | **1.49x** | **1.60x** |
   | LZ4_RAW | **1.23x** | **1.57x** |
   | GZIP | **1.47x** | **1.49x** |
   | BROTLI | **1.83x** | **1.91x** |
   
   **Raw codec throughput** (`DirectCodecFactory`): Snappy/ZSTD/LZ4/GZIP 
unchanged (already had native access). Brotli decompression improved 
**2.3-2.7x** (brotli4j >> jbrotli).
   
   V2 shows consistently larger speedups than V1 because V2 encoding produces 
more, smaller pages, meaning more codec invocations per file where the 
per-invocation Hadoop overhead accumulates.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] GH-3530: Bypass Hadoop codec abstraction to optimize compression performance [parquet-java]

Reply via email to