ayingsf opened a new issue, #14430: URL: https://github.com/apache/iceberg/issues/14430
### Apache Iceberg version 1.6.1 ### Query engine Spark ### Please describe the bug 🐞 ## Issue Summary - Spark version 3.5.5 - `Iceberg-spark-runtime-3.5_2.12` version 1.6.1 I'm getting an error when reading an iceberg table with parquet files containing Timestamp fields backed by Parquet TIMESTAMP_MILLIS type. Error I'm getting is: ```java.lang.ClassCastException: class org.apache.iceberg.shaded.org.apache.arrow.vector.TimeStampMicroTZVector cannot be cast to class org.apache.iceberg.shaded.org.apache.arrow.vector.BigIntVector (org.apache.iceberg.shaded.org.apache.arrow.vector.TimeStampMicroTZVector and org.apache.iceberg.shaded.org.apache.arrow.vector.BigIntVector are in unnamed module of loader 'app')``` Root cause seems to be that `Iceberg` is expecting a `BigIntVector` in its [vectorized arrow reader](https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java#L297) but the actual columnVector created is of type `TimeStampMicroTZVector`. The columnVector created (via `https://github.com/apache/arrow-java/blob/main/vector/src/main/java/org/apache/arrow/vector/types/pojo/FieldType.java#L107`) inherits the arrowType which is defined by Iceberg itself via [ArrowSchemaUtil](https://github.com/apache/iceberg/blob/apache-iceberg-1.6.1/arrow/src/main/java/org/apache/iceberg/arrow/ArrowSchemaUtil.java#L103) to always have microseconds precision. Hence the columnVector will always have TimeStampMicroTZVector type. When underlying parquet file has `TIMESTAMP_MICROS` data type, it takes [this path](https://github.com/apache/iceberg/blob/apache-iceberg-1.8.1/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java#L301) instead which properly casts to `TimeStampMicroTZVector` (or NTZ depending on IB metadata) Same read path, when vectorization is turned off via below config, has no errors ```spark.sql.iceberg.vectorization.enabled=false``` I don't see this logic changing in the latest version of Iceberg so this issue may still exist in latest versions. Why are we expecting a Long type vector in the vectorized reader? ## Repro There was a similar issue that reproduced above ClassCastException: https://github.com/apache/iceberg/issues/14046 In general if we generate a Parquet dataset in Spark with timestamp filed of `ms` precision and add the parquet to an Iceberg table, then read in the table via spark the above error would surface. ### Willingness to contribute - [ ] I can contribute a fix for this bug independently - [ ] I would be willing to contribute a fix for this bug with guidance from the Iceberg community - [ ] I cannot contribute a fix for this bug at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
