Re: [PR] [SPARK-56894][SQL] Add vectorized Parquet BYTE_STREAM_SPLIT reader [spark]

via GitHub Tue, 23 Jun 2026 01:52:23 -0700


iemejia commented on code in PR #55921:
URL: https://github.com/apache/spark/pull/55921#discussion_r3458328888



##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java:
##########
@@ -1389,9 +1389,7 @@ public void readValues(
         int offset,
         WritableColumnVector values,
         VectorizedValuesReader valuesReader) {
-      for (int i = 0; i < total; i++) {
-        readValue(offset + i, values, valuesReader);
-      }
+      valuesReader.readBinary(total, values, offset);

Review Comment:
   Good catch. I went with option 2 (keeps the FLBA batch speedup):
   
   Added `readFixedLenByteArray(int total, int len, WritableColumnVector c, int 
rowId)` as a `default` method on `VectorizedValuesReader`. The default falls 
back to a per-value `readBinary(len)` loop (correct for any reader), and both 
`VectorizedPlainValuesReader` and `VectorizedByteStreamSplitValuesReader` 
override it with optimized batch implementations:
   
   - **PlainValuesReader**: reads exactly `len` bytes per value from the buffer 
(no length prefix) -- mirrors `skipFixedLenByteArray` semantics.
   - **BSS reader**: delegates to the existing `readBinary(total, c, rowId)` 
which already decodes fixed-width values correctly from the de-interleaved page 
layout.
   
   `FixedLenByteArrayUpdater.readValues` now calls 
`readFixedLenByteArray(total, arrayLen, values, offset)` instead of 
`readBinary(total, values, offset)`.
   
   Added an end-to-end test that writes FLBA columns (widths 4, 12, and 8 
nullable) with `dictionaryEncoding=false` through `ExampleParquetWriter`, 
verifies PLAIN encoding in the footer metadata, and asserts the vectorized 
reader round-trip is correct.
   
   The `ParquetVectorUpdaterBenchmark` confirms this is also faster than the 
pre-PR per-value loop path: the new `readFixedLenByteArray` in 
`VectorizedPlainValuesReader` avoids the intermediate `Binary` allocation and 
`getBytesUnsafe()` copy per value. Will regenerate the benchmark result files 
on the CI machine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-56894][SQL] Add vectorized Parquet BYTE_STREAM_SPLIT reader [spark]

Reply via email to