iemejia opened a new pull request, #55924: URL: https://github.com/apache/spark/pull/55924
### What changes were proposed in this pull request? Reduces per-value heap allocations in `VectorizedDeltaByteArrayReader` (the Parquet DELTA_BYTE_ARRAY vectorized decoder) by replacing `ByteBuffer`-based state tracking with a reusable `byte[]` buffer. Key changes: 1. **Replace `ByteBuffer previous` with `byte[] prevBuf` + `int prevLen`**: The DELTA_BYTE_ARRAY encoding exploits prefix sharing between consecutive values. The old code materialized each decoded value as a `ByteBuffer` obtained from `arrayData.getByteBuffer()` (which allocates ~48 bytes per value via `ByteBuffer.wrap()`). The new approach uses a reusable `byte[]` buffer where the prefix bytes from the previous iteration are already in place at the start -- only the suffix portion needs to be written. 2. **Add `getSuffixInto()` to `VectorizedDeltaLengthByteArrayReader`**: New method that reads suffix bytes directly into a caller-supplied `byte[]` via `InputStream.read()`, avoiding the `ByteBuffer` allocation that `getBytes()` / `in.slice()` performs. Also adds `getSuffixLength()` for cases where only the length is needed. 3. **Rewrite `skipBinary()` to use `prevBuf` directly**: The old implementation was particularly wasteful -- it maintained two `WritableColumnVector` instances (`binaryValVector` and `tempBinaryValVector`), reset one per skipped value, and swapped them after each iteration. The new implementation simply assembles into `prevBuf` without touching any column vectors. 4. **Remove `tempBinaryValVector` field**: No longer needed after the `skipBinary` rewrite. ### Why are the changes needed? The DELTA_BYTE_ARRAY decoder allocated 2 `ByteBuffer` objects per decoded value (one from `in.slice()` for the suffix, one from `arrayData.getByteBuffer()` for the previous-value reference). For a typical 4096-value page, this produced ~8K short-lived objects (~384KB) that stress the young-generation GC. The `skipBinary` path was even worse: it reset and rebuilt column vectors per skipped value, involving `reserve()`, `Arrays.copyOf()`, and `putArray()` overhead just to maintain the previous-value state. After this change, the hot path performs zero per-value heap allocations (the `prevBuf` array is reused across all values and only grows if a value exceeds its current capacity). Normalized benchmark improvements (using Variant reads as cross-hardware baseline): - `readBinary`: 1.1-1.3x faster across overlap shapes - `skipBinary`: 1.5-1.9x faster (largest gains from eliminating column vector reset/swap) - `readBinary(len)` single-value: ~1.2x faster ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Existing tests: `ParquetDeltaByteArrayEncodingSuite` (14 tests), `ParquetDeltaLengthByteArrayEncodingSuite`, `ParquetEncodingSuite` -- all pass. - Benchmark: `VectorizedDeltaReaderBenchmark` Group C (DELTA_BYTE_ARRAY) with four overlap shapes (no overlap, half overlap, full overlap, len=64). ### Was this patch authored or co-authored using generative AI tooling? Generated-by: OpenCode with Claude Opus (claude-opus-4.6) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
