Re: [PR] [SPARK-56892][SQL] Bulk read optimization for Parquet DELTA_BINARY_PACKED decoding [spark]

via GitHub Fri, 12 Jun 2026 07:48:06 -0700


iemejia commented on PR #55919:
URL: https://github.com/apache/spark/pull/55919#issuecomment-4692346315


   @LuciferYang Sorry for the extra churn -- I added one more commit with 
`readIntegersAsLongs` and `readIntegersAsDoubles` overrides for the 
DELTA_BINARY_PACKED reader. It seemed worth including since the delta decoder 
already works on `long[]` internally, so these overrides skip the int narrowing 
step entirely and write longs/doubles directly from the prefix-sum buffer.
   
   Local benchmark shows **2.1x** for `readIntegersAsLongs` and **2.0x** for 
`readIntegersAsDoubles` vs the per-row default path.
   
   This benefits `DateToTimestampNTZUpdater`, `IntegerToLongUpdater`, and 
`IntegerToDoubleUpdater` when reading Parquet V2 DELTA_BINARY_PACKED encoded 
INT32 columns -- the improvement carries over automatically via the two-pass 
updater pattern in PR #55923.
   
   The review delta is small (25 lines of new code in the reader + 8 lines of 
benchmark cases) if you want to focus just on the new commit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-56892][SQL] Bulk read optimization for Parquet DELTA_BINARY_PACKED decoding [spark]

Reply via email to