iemejia opened a new pull request, #3555:
URL: https://github.com/apache/parquet-java/pull/3555

   ## Summary
   
   Bypass the Hadoop `Compressor`/`Decompressor`/`CodecPool` abstraction layer 
in `CodecFactory` and `DirectCodecFactory`, calling native compression 
libraries directly. This eliminates per-page stream creation, intermediate 
buffer copies, and codec pool synchronization for all four supported codecs.
   
   ### What changes
   
   - **Snappy**: Replace `CodecPool` + `SnappyCompressor` (which copies 
heap→direct→heap) with a single `Snappy.compress(byte[], byte[])` / 
`Snappy.uncompress(byte[], byte[])` JNI call and a reusable output buffer.
   - **LZ4_RAW**: Replace `NonBlockedCompressor` (which allocates direct 
ByteBuffers and copies heap↔direct twice per call) with heap 
`ByteBuffer.wrap()` and direct airlift LZ4 compress/decompress — zero 
intermediate copies.
   - **ZSTD**: Replace `ZstdCompressorStream` with 
`ZstdOutputStreamNoFinalizer` (avoids finalizer registration) and cache the 
ZSTD level / buffer pool configuration reads per compressor instance instead of 
re-reading `Configuration` on each page.
   - **GZIP**: Replace Hadoop's `GzipCodec` (which wraps Java's 
`Deflater`/`Inflater` in stream abstractions) with direct `Deflater`/`Inflater` 
usage, reusing instances via `reset()` and managing GZIP headers/trailers 
manually.
   - **Benchmark**: Update `CompressionBenchmark` page sizes from `{8KB, 64KB, 
256KB}` to `{64KB, 128KB, 256KB, 1MB}` to reflect real-world Parquet page sizes 
(most pages are 64-256KB due to the 20K row-count limit from PARQUET-1414; only 
wide string/binary columns hit the 1MB size limit).
   
   ### Benchmark results (ops/s, higher is better)
   
   #### Compression
   
   | Codec | Page Size | Master | Branch | Delta |
   |-------|-----------|-------:|-------:|------:|
   | SNAPPY | 64 KB | 53,979 | 60,799 | **+12.6%** |
   | SNAPPY | 128 KB | 27,764 | 30,524 | **+9.9%** |
   | SNAPPY | 256 KB | 13,549 | 14,648 | **+8.1%** |
   | SNAPPY | 1 MB | 2,445 | 2,675 | **+9.4%** |
   | LZ4_RAW | 1 MB | 1,961 | 2,191 | **+11.7%** |
   | LZ4_RAW | 64-256 KB | — | — | within noise (-1 to -4%) |
   | ZSTD | all sizes | — | — | within noise |
   | GZIP | all sizes | — | — | within noise |
   
   #### Decompression
   
   | Codec | Page Size | Master | Branch | Delta |
   |-------|-----------|-------:|-------:|------:|
   | LZ4_RAW | 64 KB | 80,415 | 118,358 | **+47.2%** |
   | LZ4_RAW | 128 KB | 40,615 | 59,620 | **+46.8%** |
   | LZ4_RAW | 256 KB | 19,888 | 29,914 | **+50.4%** |
   | LZ4_RAW | 1 MB | 4,628 | 7,517 | **+62.4%** |
   | SNAPPY | 64 KB | 60,928 | 67,224 | **+10.3%** |
   | SNAPPY | 128 KB | 29,919 | 33,457 | **+11.8%** |
   | SNAPPY | 256 KB | 14,431 | 15,912 | **+10.3%** |
   | SNAPPY | 1 MB | 3,140 | 3,540 | **+12.7%** |
   | ZSTD | 64 KB | 32,042 | 35,750 | **+11.6%** |
   | ZSTD | 128 KB | 19,447 | 21,800 | **+12.1%** |
   | ZSTD | 256 KB | 9,495 | 10,759 | **+13.3%** |
   | ZSTD | 1 MB | 2,155 | 2,409 | **+11.8%** |
   | GZIP | 128 KB | 4,101 | 4,536 | **+10.6%** |
   | GZIP | 256 KB | 1,736 | 1,891 | **+8.9%** |
   | GZIP | 1 MB | 406 | 442 | **+9.1%** |
   
   JMH config: JDK 25.0.3 Temurin, 1 fork, 2 warmup × 1s, 3 measurement × 2s.
   
   ### Why LZ4_RAW decompression gains are largest
   
   `NonBlockedDecompressor` performs two full data copies per operation — heap 
byte[] → direct ByteBuffer on input, direct ByteBuffer → heap byte[] on output 
— plus direct buffer allocation and synchronized access. The bypass eliminates 
both copies by using `ByteBuffer.wrap()` on heap arrays, letting airlift's LZ4 
decompress directly between heap buffers.
   
   ### Why ZSTD compression gains are minimal
   
   `ZstandardCodec` already returns `null` from 
`createCompressor()`/`createDecompressor()` and delegates directly to 
`zstd-jni` streams. The Hadoop abstraction overhead was already bypassed at the 
codec level. The branch adds finalizer avoidance (`NoFinalizer` variants) and 
caches configuration reads, which helps decompression but leaves compression 
within noise.
   
   ### Alternative considered: modify codecs instead of CodecFactory
   
   We evaluated modifying `SnappyCodec` and `Lz4RawCodec` to follow the 
`ZstandardCodec` pattern (return `null` from `createCompressor()`, use custom 
stream wrappers). This approach was **25-50% slower** than the `CodecFactory` 
bypass for Snappy/LZ4 and even **20-47% slower than master**. The per-call 
stream creation, `ByteArrayOutputStream` buffering, and lack of buffer reuse 
dominate for memory-bandwidth-bound codecs where the actual compression takes 
only 8-65 microseconds.
   
   ### Files changed
   
   - `CodecFactory.java`: Bypass compressor/decompressor with codec-specific 
inner classes (`SnappyBytesCompressor`, `Lz4RawBytesCompressor`, 
`ZstdBytesCompressor`, `GzipBytesCompressor` + matching decompressors)
   - `DirectCodecFactory.java`: Bypass for direct `ByteBuffer` path (Snappy, 
LZ4_RAW, ZSTD)
   - `BytesInput.java`: Add `ByteBufferBackedOutputStream` to avoid 
`toByteArray()` copies
   - `CompressionBenchmark.java`: Realistic page sizes + JMH annotation 
processor fix for Java 17+
   - `TestDirectCodecFactory.java`: Updated tests for bypass path


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to