harperjiang opened a new pull request, #16501:
URL: https://github.com/apache/iceberg/pull/16501
The vectorized Arrow reader fails to allocate a vector for any decimal
column that has an `initialDefault` or `writeDefault` on its Iceberg field.
Reads through `VectorizedTableScanIterable` throw:
```
java.lang.IllegalArgumentException: Cannot cast default value to FIXED:
<default>
at org.apache.iceberg.types.Types$NestedField.castDefault(Types.java:892)
at
org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.getPhysicalType(VectorizedArrowReader.java:255)
at
org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.allocateFieldVector(VectorizedArrowReader.java:228)
```
(`fixed[N]` for FIXED_LEN_BYTE_ARRAY-backed decimal, with the same shape.)
`VectorizedArrowReader#getPhysicalType` rewrites a decimal Iceberg field to
its underlying physical type (`fixed[N]`) so the right Arrow vector class is
allocated. It does this with
`Types.NestedField.from(logicalType).ofType(type).build()`, which copies the
field's `initialDefault` / `writeDefault` onto the new physical type.
`NestedField`'s constructor then runs `castDefault(literal, type)`, which
calls `DecimalLiteral.to(FixedType)` — that conversion is not defined for
decimal literals and returns `null`, tripping the `Preconditions.checkArgument`
in `castDefault`.
The defaults are semantically tied to the logical (decimal) view of the
column and should not flow to the physical representation — the physical type
is an implementation detail used only to size the Arrow vector. The fix
constructs the physical field with a fresh `Types.NestedField.builder()`,
carrying over only `id`, `name`, `optionality`, and `doc`, and omitting both
defaults.
The bug only surfaces when the column is not dictionary-encoded, because
`allocateDictEncodedVector` does not call `getPhysicalType`. The new test
disables dictionary encoding to make the regression deterministic.
### Testing
Added `TestArrowReader#testDecimalWithDefaultIsReadByVectorizedReader`,
which:
- creates a v3 table with a `DECIMAL(5, 2)` column carrying both
`initialDefault` and `writeDefault`,
- writes a Parquet file using INT32-backed decimal with dictionary encoding
disabled, and
- reads via `VectorizedTableScanIterable` and asserts the raw INT32 values.
Without the fix the test fails at vector allocation; with the fix all rows
are read correctly.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]