harperjiang opened a new pull request, #16501:
URL: https://github.com/apache/iceberg/pull/16501

   The vectorized Arrow reader fails to allocate a vector for any decimal 
column that has an `initialDefault` or `writeDefault` on its Iceberg field. 
Reads through `VectorizedTableScanIterable` throw:
   
   ```
   java.lang.IllegalArgumentException: Cannot cast default value to FIXED: 
<default>
     at org.apache.iceberg.types.Types$NestedField.castDefault(Types.java:892)
     at 
org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.getPhysicalType(VectorizedArrowReader.java:255)
     at 
org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.allocateFieldVector(VectorizedArrowReader.java:228)
   ```
   
   (`fixed[N]` for FIXED_LEN_BYTE_ARRAY-backed decimal, with the same shape.)
   
   `VectorizedArrowReader#getPhysicalType` rewrites a decimal Iceberg field to 
its underlying physical type (`fixed[N]`) so the right Arrow vector class is 
allocated. It does this with 
`Types.NestedField.from(logicalType).ofType(type).build()`, which copies the 
field's `initialDefault` / `writeDefault` onto the new physical type. 
   
   `NestedField`'s constructor then runs `castDefault(literal, type)`, which 
calls `DecimalLiteral.to(FixedType)` — that conversion is not defined for 
decimal literals and returns `null`, tripping the `Preconditions.checkArgument` 
in `castDefault`.
   
   The defaults are semantically tied to the logical (decimal) view of the 
column and should not flow to the physical representation — the physical type 
is an implementation detail used only to size the Arrow vector. The fix 
constructs the physical field with a fresh `Types.NestedField.builder()`, 
carrying over only `id`, `name`, `optionality`, and `doc`, and omitting both 
defaults.
   
   The bug only surfaces when the column is not dictionary-encoded, because 
`allocateDictEncodedVector` does not call `getPhysicalType`. The new test 
disables dictionary encoding to make the regression deterministic.
   
   ### Testing
   
   Added `TestArrowReader#testDecimalWithDefaultIsReadByVectorizedReader`, 
which:
   - creates a v3 table with a `DECIMAL(5, 2)` column carrying both 
`initialDefault` and `writeDefault`,
   - writes a Parquet file using INT32-backed decimal with dictionary encoding 
disabled, and
   - reads via `VectorizedTableScanIterable` and asserts the raw INT32 values.
   
   Without the fix the test fails at vector allocation; with the fix all rows 
are read correctly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to