Re: [PR] [SPARK-56892][SQL] Bulk read optimization for Parquet DELTA_BINARY_PACKED decoding [spark]

via GitHub Fri, 12 Jun 2026 00:48:04 -0700


iemejia commented on PR #55919:
URL: https://github.com/apache/spark/pull/55919#issuecomment-4688824001


   @LuciferYang All three benchmark runs are now on AMD EPYC 7763 (JDK 17, 21, 
25) and the results are pretty promising:
   
   **INT64 reads**: 1.8x-3.7x across all JDKs and data patterns
   **INT64 skip**: 2.3x-4.0x
   **Unsigned long encoding** (with the new `byte[]` loop): 7.3x-8.6x
   **INT32 reads**: 1.1x-1.6x (narrowing overhead limits gains)
   **DELTA_BYTE_ARRAY / DELTA_LENGTH_BYTE_ARRAY**: 1.2x-1.9x indirect 
improvement
   
   Updated the PR description with full JDK 17/21/25 comparison tables and the 
new workflow run links.
   
   Thank you for all your help and the thorough review suggestions -- the 
`byte[]` loop approach is cleaner and avoids the ByteBuffer abstraction 
entirely, and moving the scratch buffer allocation to `initFromPage` makes the 
code more straightforward. Really appreciate the guidance on getting the 
benchmark workflow right too.
   
   I believe this is ready to go now -- would you be able to merge it when you 
get a chance?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-56892][SQL] Bulk read optimization for Parquet DELTA_BINARY_PACKED decoding [spark]

Reply via email to