Re: [PR] [SPARK-56894][SQL] Add vectorized Parquet BYTE_STREAM_SPLIT reader [spark]

via GitHub Tue, 23 Jun 2026 03:23:47 -0700


iemejia commented on PR #55921:
URL: https://github.com/apache/spark/pull/55921#issuecomment-4778197470


   @MaxGekk This is ready for another round of review. Here's a summary of the 
changes since your last review:
   
   **Blocking fix (option 2):** Added a dedicated `readFixedLenByteArray(int 
total, int len, WritableColumnVector c, int rowId)` default method to 
`VectorizedValuesReader`, with optimized overrides in 
`VectorizedPlainValuesReader` (reads `len` bytes directly, no length prefix) 
and `VectorizedByteStreamSplitValuesReader` (delegates to the existing correct 
`readBinary` batch method). `FixedLenByteArrayUpdater.readValues` now calls 
this instead of `readBinary(total, c, rowId)`.
   
   **End-to-end test:** Added a PLAIN-encoded FLBA regression test in 
`ParquetEncodingSuite` that writes FLBA columns (widths 4, 12, and 8 nullable) 
with `dictionaryEncoding=false`, verifies PLAIN encoding in the footer 
metadata, and asserts vectorized reader round-trip correctness.
   
   **Performance:** The `ParquetVectorUpdaterBenchmark` on AMD EPYC 7763 
confirms the fix is 1.4-1.6x faster than the pre-PR per-value loop for 
`FixedLenByteArrayUpdater` (18-21 ms -> 13 ms), with no regressions in any 
other updater group. See the updated PR description for the full breakdown.
   
   **PR description:** Updated to document the `FixedLenByteArrayUpdater` scope 
change as you suggested.
   
   **Rebased** on the latest master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-56894][SQL] Add vectorized Parquet BYTE_STREAM_SPLIT reader [spark]

Reply via email to