Re: [PR] [SPARK-56894][SQL] Add vectorized Parquet BYTE_STREAM_SPLIT reader [spark]

via GitHub Tue, 23 Jun 2026 01:13:26 -0700


MaxGekk commented on code in PR #55921:
URL: https://github.com/apache/spark/pull/55921#discussion_r3458083526



##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java:
##########
@@ -1389,9 +1389,7 @@ public void readValues(
         int offset,
         WritableColumnVector values,
         VectorizedValuesReader valuesReader) {
-      for (int i = 0; i < total; i++) {
-        readValue(offset + i, values, valuesReader);
-      }
+      valuesReader.readBinary(total, values, offset);

Review Comment:
   **[blocking]** This routes *all* FLBA batch reads through 
`VectorizedValuesReader.readBinary(int total, ...)`, but that method has a 
length-prefixed BYTE_ARRAY contract for every non-BSS reader. 
`VectorizedPlainValuesReader.readBinary(int total, ...)` reads a 4-byte length 
prefix per value (`int len = readInteger(); getBuffer(len)`), whereas 
FIXED_LEN_BYTE_ARRAY data has **no** length prefix — each value is exactly 
`arrayLen` raw bytes. Only the new BSS reader's `readBinary(total,...)` is 
fixed-width.
   
   Since this updater is encoding-agnostic, a PLAIN-encoded FLBA column (a 
`BinaryType` column, or a decimal with precision > 18 — both selected here via 
`canReadAsBinaryDecimal` / `BinaryType`) now consumes the first 4 bytes of each 
value as a bogus length prefix → corrupted bytes or an exception. Pre-PR, the 
per-value loop called single-value `readBinary(arrayLen)` (no prefix) and was 
correct.
   
   Why CI is green: Parquet dictionary-encodes by default, and dictionary pages 
bypass `updater.readValues` entirely (`VectorizedColumnReader.readBatch` → 
`decodeDictionaryIds`). The PLAIN path is only reached on dictionary fallback, 
disabled dictionary, or high cardinality — none of which the tests here 
exercise (the unit suite drives the reader directly, never through 
`VectorizedColumnReader`/updaters; the `ParquetEncodingSuite` FLBA test uses 
BSS encoding; `ParquetVectorizedSuite` covers only BYTE_ARRAY).
   
   Fix options (tradeoff for you to pick):
   - **Minimal / safest:** revert this line to the per-value loop. BSS FLBA 
still works (the BSS reader's single-value `readBinary(len)` is correct); only 
the FLBA *batch* speedup is lost.
   - **Keeps the FLBA batch speedup:** add a dedicated width-carrying batch 
method (e.g. `readFixedLenByteArray(int total, int len, WritableColumnVector, 
int rowId)`) implemented correctly by both `VectorizedPlainValuesReader` 
(fixed-width) and the new BSS reader, and call that here instead of the 
length-prefixed `readBinary`.
   
   Either way, please add an end-to-end test: write an FLBA column (BinaryType 
and/or decimal precision > 18) with `parquet.enable.dictionary=false` and 
enough distinct/large values to force a PLAIN data page, read it through the 
vectorized reader, and assert the round-trip.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-56894][SQL] Add vectorized Parquet BYTE_STREAM_SPLIT reader [spark]

Reply via email to