[PR] [SPARK-56892][SQL] Bulk read optimization for Parquet DELTA_BINARY_PACKED decoding [spark]

via GitHub Tue, 16 Jun 2026 02:19:25 -0700


iemejia opened a new pull request, #56543:
URL: https://github.com/apache/spark/pull/56543


   ### What changes were proposed in this pull request?
   
   Re-apply the bulk read optimization for `VectorizedDeltaBinaryPackedReader` 
(reverted in c13302acc2a) with a fix for the INT32 widening bug that caused the 
CI failure.
   
   **Commit 1** — Reapply the original optimization (revert of the revert):
   - Bulk `readIntegers`/`readLongs` via prefix-sum + `putInts`/`putLongs`
   - Zero-allocation unsigned long encoding (`encodeUnsignedLongBigEndian`)
   - `readIntegersAsLongs` and `readIntegersAsDoubles` overrides
   
   **Commit 2** — Fix the INT32 widening bug:
   - The Parquet INT32 delta encoder 
(`DeltaBinaryPackingValuesWriterForInteger`) computes deltas using Java int 
arithmetic with modular overflow. The bulk widened readers 
(`readIntegersAsLongs`, `readIntegersAsDoubles`) were performing the prefix sum 
in long space and writing raw long results without truncating back to int. When 
delta overflow occurs (e.g. a sequence containing `Int.MinValue`), the 
reconstructed long has the wrong sign.
   - Fix: truncate each prefix-sum result to int before widening to long/double
   - Add focused low-level tests for the overflow case (single-batch and split 
reads)
   - Add benchmark cases for the overflow pattern
   
   This is the same content as #55919, which was merged and reverted due to 
this bug.
   
   ### Why are the changes needed?
   
   The bulk read path eliminates per-value lambda dispatch overhead and enables 
the JIT to better vectorize the inner unpacking loop. See #55919 for full 
benchmark results.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   - `ParquetTypeWideningSuite`: IntegerType -> LongType, IntegerType -> 
DoubleType
   - `ParquetDeltaEncodingInteger`: new focused tests for modular delta overflow
   - `ParquetDeltaEncodingInteger`/`Long`: full suites (30 tests)
   - `ParquetIOSuite`: UINT_64 tests
   - `VectorizedDeltaReaderBenchmark`: full suite including new overflow cases
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Yes.
   
   Assisted-by: GitHub Copilot:claude-opus-4.6
   
   cc @LuciferYang @sunchao


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-56892][SQL] Bulk read optimization for Parquet DELTA_BINARY_PACKED decoding [spark]

Reply via email to