MaxGekk commented on code in PR #55921:
URL: https://github.com/apache/spark/pull/55921#discussion_r3458083526
##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java:
##########
@@ -1389,9 +1389,7 @@ public void readValues(
int offset,
WritableColumnVector values,
VectorizedValuesReader valuesReader) {
- for (int i = 0; i < total; i++) {
- readValue(offset + i, values, valuesReader);
- }
+ valuesReader.readBinary(total, values, offset);
Review Comment:
**[blocking]** This routes *all* FLBA batch reads through
`VectorizedValuesReader.readBinary(int total, ...)`, but that method has a
length-prefixed BYTE_ARRAY contract for every non-BSS reader.
`VectorizedPlainValuesReader.readBinary(int total, ...)` reads a 4-byte length
prefix per value (`int len = readInteger(); getBuffer(len)`), whereas
FIXED_LEN_BYTE_ARRAY data has **no** length prefix — each value is exactly
`arrayLen` raw bytes. Only the new BSS reader's `readBinary(total,...)` is
fixed-width.
Since this updater is encoding-agnostic, a PLAIN-encoded FLBA column (a
`BinaryType` column, or a decimal with precision > 18 — both selected here via
`canReadAsBinaryDecimal` / `BinaryType`) now consumes the first 4 bytes of each
value as a bogus length prefix → corrupted bytes or an exception. Pre-PR, the
per-value loop called single-value `readBinary(arrayLen)` (no prefix) and was
correct.
Why CI is green: Parquet dictionary-encodes by default, and dictionary pages
bypass `updater.readValues` entirely (`VectorizedColumnReader.readBatch` →
`decodeDictionaryIds`). The PLAIN path is only reached on dictionary fallback,
disabled dictionary, or high cardinality — none of which the tests here
exercise (the unit suite drives the reader directly, never through
`VectorizedColumnReader`/updaters; the `ParquetEncodingSuite` FLBA test uses
BSS encoding; `ParquetVectorizedSuite` covers only BYTE_ARRAY).
Fix options (tradeoff for you to pick):
- **Minimal / safest:** revert this line to the per-value loop. BSS FLBA
still works (the BSS reader's single-value `readBinary(len)` is correct); only
the FLBA *batch* speedup is lost.
- **Keeps the FLBA batch speedup:** add a dedicated width-carrying batch
method (e.g. `readFixedLenByteArray(int total, int len, WritableColumnVector,
int rowId)`) implemented correctly by both `VectorizedPlainValuesReader`
(fixed-width) and the new BSS reader, and call that here instead of the
length-prefixed `readBinary`.
Either way, please add an end-to-end test: write an FLBA column (BinaryType
and/or decimal precision > 18) with `parquet.enable.dictionary=false` and
enough distinct/large values to force a PLAIN data page, read it through the
vectorized reader, and assert the round-trip.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]