iemejia opened a new pull request, #55932: URL: https://github.com/apache/spark/pull/55932
### What changes were proposed in this pull request? This PR reduces object allocation in the DELTA_LENGTH_BYTE_ARRAY vectorized Parquet reader (`VectorizedDeltaLengthByteArrayReader`) by applying three targeted changes: **readBinary**: Replace per-value `in.slice(length)` (one ByteBuffer allocation per value) with a single bulk `in.slice(totalDataLen)` that reads the entire batch at once. Individual values are then written to the column vector via `putByteArray` from the shared backing array, eliminating N-1 ByteBuffer object allocations. **skipBinary**: Replace the per-value skip loop (N separate `in.skip()` calls) with a single bulk skip by summing all value lengths upfront. **readGeoData**: Remove the `ByteBuffer.wrap()` + `ByteBufferOutputWriter` indirection per value and call `putByteArray` directly from the converter output array. ### Why are the changes needed? The DELTA_LENGTH_BYTE_ARRAY encoding is used for binary/string columns in Parquet v2 pages. In the current vectorized reader, `readBinary` allocates one `ByteBuffer` per value via `in.slice(length)`, and `skipBinary` performs a separate stream skip per value. For large batches (e.g. 1M values per page), this creates significant allocation pressure and per-call overhead. Micro-benchmarks on `VectorizedDeltaReaderBenchmark` Group D show: | Benchmark | Before (ms) | After (ms) | Speedup | |---|---|---|---| | readBinary, payloadLen=8 | 12 | 10 | **1.2x** | | readBinary, payloadLen=32 | 16 | 14 | **1.1x** | | readBinary, payloadLen=128 | 13 | 12 | **1.1x** | | readBinary, payloadLen=512 | 32 | 32 | ~1.0x | | skipBinary (all sizes) | 7 | 5 | **1.4x** | `readBinary` speedup is larger for small payloads where allocation cost dominates. `skipBinary` shows consistent 1.4x improvement across all payload sizes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests: `ParquetDeltaLengthByteArrayEncodingSuite` (14 tests including serialization, random strings, empty strings, skip interleaving, and geo types) and `ParquetEncodingSuite` all pass. Benchmarks: `VectorizedDeltaReaderBenchmark` Group D (DELTA_LENGTH_BYTE_ARRAY) run locally on JDK 17. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: OpenCode with Claude claude-opus-4.6 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
