[PR] [SPARK-56895][SQL] Batch ByteBuffer slice in RLE PACKED decode to reduce allocation overhead [spark]

via GitHub Sat, 16 May 2026 16:18:49 -0700


iemejia opened a new pull request, #55922:
URL: https://github.com/apache/spark/pull/55922


   ### What changes were proposed in this pull request?
   
   In `VectorizedRleValuesReader.readNextGroup()`, the PACKED branch previously 
called `in.slice(bitWidth)` once per group of 8 values, allocating a new 
`ByteBuffer` view per call. This PR batches all packed bytes into a single 
`in.slice(numGroups * bitWidth)` call, then indexes into the resulting buffer 
with an advancing position offset.
   
   The change is 4 lines of net diff in `readNextGroup()`:
   
   ```java
   // Before: one slice() per group — N ByteBuffer allocations
   while (valueIndex < this.currentCount) {
       ByteBuffer buffer = in.slice(bitWidth);
       this.packer.unpack8Values(buffer, buffer.position(), this.currentBuffer, 
valueIndex);
       valueIndex += 8;
   }
   
   // After: single slice() for all groups — 1 ByteBuffer allocation
   int totalBytes = numGroups * bitWidth;
   ByteBuffer packed = in.slice(totalBytes);
   int pos = packed.position();
   while (valueIndex < this.currentCount) {
       this.packer.unpack8Values(packed, pos, this.currentBuffer, valueIndex);
       pos += bitWidth;
       valueIndex += 8;
   }
   ```
   
   This works because `BytePacker.unpack8Values(ByteBuffer, int inPos, int[], 
int)` uses absolute positioning (`inPos` is the absolute offset into the 
buffer, not relative to `position()`).
   
   ### Why are the changes needed?
   
   `ByteBufferInputStream.slice()` calls `buffer.duplicate()` which allocates a 
new `ByteBuffer` object each time. For a 1M-value PACKED page at bitWidth=4, 
that is 128K `ByteBuffer` allocations per page. These short-lived objects 
create GC pressure and prevent `BytePacker.unpack8Values()` from being inlined 
(the `ByteBuffer` argument escapes through the abstract method call, defeating 
scalar replacement).
   
   With a single slice, the entire PACKED block shares one `ByteBuffer` view. 
The `unpack8Values` calls index into it via an advancing absolute position, 
which C2 can optimize far more aggressively.
   
   Benchmark results on the same machine (AMD EPYC 9V45, JDK 17), before vs 
after:
   
   **readIntegers (dictionary-id decode, PACKED):**
   
   | bitWidth | Before (M/s) | After (M/s) | Speedup |
   |----------|-------------|-------------|---------|
   | 4        | 886         | 1862        | 2.1x    |
   | 8        | 885         | 2088        | 2.4x    |
   | 12       | 609         | 953         | 1.6x    |
   | 20       | 518         | 709         | 1.4x    |
   
   **Single-value reads (readInteger, PACKED):**
   
   | bitWidth | Before (M/s) | After (M/s) | Speedup |
   |----------|-------------|-------------|---------|
   | 4        | 490         | 690         | 1.4x    |
   | 8        | 495         | 722         | 1.5x    |
   | 12       | 392         | 510         | 1.3x    |
   | 20       | 354         | 431         | 1.2x    |
   
   **skipIntegers (PACKED):**
   
   | bitWidth | Before (M/s) | After (M/s) | Speedup |
   |----------|-------------|-------------|---------|
   | 4        | 916         | 1942        | 2.1x    |
   | 8        | 915         | 2238        | 2.4x    |
   | 12       | 622         | 988         | 1.6x    |
   | 20       | 528         | 722         | 1.4x    |
   
   RLE paths are unaffected (no PACKED decode involved). The benefit is largest 
at smaller bitWidths where more groups (and therefore more slice calls) are 
needed per page.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. This is an internal performance optimization in the Parquet vectorized 
reader. No API, configuration, or behavioral changes.
   
   ### How was this patch tested?
   
   - Existing `VectorizedRleValuesReaderSuite` (14 tests covering PACKED mode, 
RLE mode, mixed, row-index filtering, multi-page, cross-batch continuity) — all 
pass.
   - Existing `ParquetQuerySuite`, `ParquetIOSuite`, `ParquetEncodingSuite`, 
`ParquetSchemaSuite` — all 244 tests pass on JDK 17.
   - `VectorizedRleValuesReaderBenchmark` — before/after comparison on the same 
machine shown above.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: OpenCode (Claude claude-opus-4.6)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-56895][SQL] Batch ByteBuffer slice in RLE PACKED decode to reduce allocation overhead [spark]

Reply via email to