Re: [PR] Data, Parquet: Fix UUID ClassCastException when reading Parquet files with UUIDs written by PyIceberg [iceberg]

via GitHub Sun, 14 Sep 2025 09:08:06 -0700


ndrluis commented on PR #14027:
URL: https://github.com/apache/iceberg/pull/14027#issuecomment-3289655448


   Quick update on this issue - I'm going to focus on solving this problem on 
the Java side first. Once Iceberg Java has the correct behavior, I'll come back 
to PyIceberg and make the necessary adjustments. So here's the minimal test 
that I'm running using PySpark (since I have more familiarity with it than the 
Java environment).
   
   **Tested with the following Iceberg Runtimes**:
   org.apache.iceberg#iceberg-spark-runtime-3.5_2.12;1.9.0
   org.apache.iceberg#iceberg-spark-runtime-3.5_2.12;1.10.0
   
   **Test Case**
   ```python
   @pytest.mark.integration
   def test_uuid_write_read_with_pyspark(session_catalog: Catalog, spark: 
SparkSession) -> None:
       identifier = "default.test_uuid_write_and_read_with_pyspark"
   
       catalog = load_catalog("default", type="in-memory")
       catalog.create_namespace("ns")
   
       schema = Schema(NestedField(field_id=1, name="uuid_col", 
field_type=UUIDType(), required=False))
   
       try:
           session_catalog.drop_table(identifier=identifier)
       except NoSuchTableError:
           pass
   
       table = _create_table(session_catalog, identifier, {"format-version": 
"2"}, schema=schema)
   
       spark.sql(
           f"""
           INSERT INTO {identifier} VALUES 
("22222222-2222-2222-2222-222222222222")
           """
       )
       df = spark.table(identifier)
   
       assert df.count() == 1
   
       result = df.where("uuid_col = '22222222-2222-2222-2222-222222222222'")
       assert result.count() == 1
   ```
   
   **Error**
   The test passes for df.count() but fails when applying the WHERE condition 
with the following error:
   
   ```
   25/09/14 12:45:49 ERROR BaseReader: Error reading file(s): 
s3://warehouse/default/test_uuid_write_and_read_with_pyspark/data/00000-0-c8b11c46-5ef7-426e-a1d5-de8aa720af6d-0-00001.parquet
   java.lang.ClassCastException: class java.util.UUID cannot be cast to class 
java.nio.ByteBuffer (java.util.UUID and java.nio.ByteBuffer are in module 
java.base of loader 'bootstrap')
           at java.base/java.nio.ByteBuffer.compareTo(ByteBuffer.java:267)
           at 
java.base/java.util.Comparators$NaturalOrderComparator.compare(Comparators.java:52)
           at 
java.base/java.util.Comparators$NaturalOrderComparator.compare(Comparators.java:47)
           at 
org.apache.iceberg.types.Comparators$NullSafeChainedComparator.compare(Comparators.java:253)
           at 
org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter$MetricsEvalVisitor.eq(ParquetMetricsRowGroupFilter.java:352)
           at 
org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter$MetricsEvalVisitor.eq(ParquetMetricsRowGroupFilter.java:79)
           at 
org.apache.iceberg.expressions.ExpressionVisitors$BoundExpressionVisitor.predicate(ExpressionVisitors.java:162)
           at 
org.apache.iceberg.expressions.ExpressionVisitors.visitEvaluator(ExpressionVisitors.java:390)
           at 
org.apache.iceberg.expressions.ExpressionVisitors.visitEvaluator(ExpressionVisitors.java:409)
           at 
org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter$MetricsEvalVisitor.eval(ParquetMetricsRowGroupFilter.java:103)
           at 
org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter.shouldRead(ParquetMetricsRowGroupFilter.java:73)
           at org.apache.iceberg.parquet.ReadConf.<init>(ReadConf.java:108)
           at 
org.apache.iceberg.parquet.VectorizedParquetReader.init(VectorizedParquetReader.java:90)
           at 
org.apache.iceberg.parquet.VectorizedParquetReader.iterator(VectorizedParquetReader.java:99)
           at 
org.apache.iceberg.spark.source.BatchDataReader.open(BatchDataReader.java:116)
           at 
org.apache.iceberg.spark.source.BatchDataReader.open(BatchDataReader.java:43)
           at 
org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:134)
           at 
org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:120)
           at 
org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:158)
           [... rest of stack trace ...]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Data, Parquet: Fix UUID ClassCastException when reading Parquet files with UUIDs written by PyIceberg [iceberg]

Reply via email to