ayingsf opened a new issue, #14430:
URL: https://github.com/apache/iceberg/issues/14430

   ### Apache Iceberg version
   
   1.6.1
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   ## Issue Summary
   
   - Spark version 3.5.5
   - `Iceberg-spark-runtime-3.5_2.12` version 1.6.1
   
   I'm getting an error when reading an iceberg table with parquet files 
containing Timestamp fields backed by Parquet TIMESTAMP_MILLIS type. Error I'm 
getting is:
   
   ```java.lang.ClassCastException: class 
org.apache.iceberg.shaded.org.apache.arrow.vector.TimeStampMicroTZVector cannot 
be cast to class org.apache.iceberg.shaded.org.apache.arrow.vector.BigIntVector 
(org.apache.iceberg.shaded.org.apache.arrow.vector.TimeStampMicroTZVector and 
org.apache.iceberg.shaded.org.apache.arrow.vector.BigIntVector are in unnamed 
module of loader 'app')```
   
   Root cause seems to be that `Iceberg` is expecting a `BigIntVector` in its 
[vectorized arrow 
reader](https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java#L297)
 but the actual columnVector created is of type `TimeStampMicroTZVector`.
   
   The columnVector created (via 
`https://github.com/apache/arrow-java/blob/main/vector/src/main/java/org/apache/arrow/vector/types/pojo/FieldType.java#L107`)
 inherits the arrowType which is defined by Iceberg itself via 
[ArrowSchemaUtil](https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/arrow/src/main/java/org/apache/iceberg/arrow/ArrowSchemaUtil.java#L103)
 to always have microseconds precision. Hence the columnVector will always have 
TimeStampMicroTZVector type. 
   
   When underlying parquet file has `TIMESTAMP_MICROS` data type, it takes 
[this 
path](https://github.com/apache/iceberg/blob/apache-iceberg-1.8.1/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java#L301)
 instead which properly casts to `TimeStampMicroTZVector` (or NTZ depending on 
IB metadata)
   
   Same read path, when vectorization is turned off via below config, has no 
errors
   
   ```spark.sql.iceberg.vectorization.enabled=false```
   
   I don't see this logic changing in the latest version of Iceberg so this 
issue may still exist in latest versions. Why are we expecting a Long type 
vector in the vectorized reader? 
   
   ## Repro
   
   There was a similar issue that reproduced above ClassCastException: 
https://github.com/apache/iceberg/issues/14046 
   
   In general if we generate a Parquet dataset in Spark with timestamp filed of 
`ms` precision and add the parquet to an Iceberg table, then read in the table 
via spark the above error would surface. 
   
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [ ] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to