mbutrovich commented on issue #3255:
URL: 
https://github.com/apache/datafusion-comet/issues/3255#issuecomment-4672339575

   I wasn't able to reproduce this with the current Iceberg code path. I 
suspect this performed the scan in Iceberg Java, and then somehow they made 
their way over to Comet without unpacking the dictionaries. I wrote a 
reproducer and couldn't get it to fail. Please reopen if you re-encounter the 
issue. Thanks for the report!
   
   ```scala
     // https://github.com/apache/datafusion-comet/issues/3255
     // Reported as CometNativeException "Cannot perform binary operation on 
arrays of
     // different length" when applying to_date / datediff to a timestamp 
column read from
     // an Iceberg table. The suspected trigger is dictionary-encoded timestamp 
values, so
     // we insert many rows with few distinct timestamps to encourage 
dictionary encoding.
     test("to_date and datediff on Iceberg timestamp column - reproduces 
#3255") {
       assume(icebergAvailable, "Iceberg not available in classpath")
   
       withTempIcebergDir { warehouseDir =>
         withSQLConf(
           "spark.sql.catalog.test_cat" -> 
"org.apache.iceberg.spark.SparkCatalog",
           "spark.sql.catalog.test_cat.type" -> "hadoop",
           "spark.sql.catalog.test_cat.warehouse" -> 
warehouseDir.getAbsolutePath,
           CometConf.COMET_ENABLED.key -> "true",
           CometConf.COMET_EXEC_ENABLED.key -> "true",
           CometConf.COMET_ICEBERG_NATIVE_ENABLED.key -> "true") {
   
           // Spark TIMESTAMP maps to Iceberg timestamptz (TIMESTAMPTZ), 
matching the report.
           spark.sql("""
             CREATE TABLE test_cat.db.ts_date_test (
               id STRING,
               ts TIMESTAMP
             ) USING iceberg
           """)
   
           // 1000 rows over only 5 distinct timestamps -> highly compressible 
/ dictionary-friendly.
           spark.sql("""
             INSERT INTO test_cat.db.ts_date_test
             SELECT
               CAST(id AS STRING) as id,
               CAST(CONCAT('2024-01-0', CAST(id % 5 + 1 AS STRING), ' 
12:00:00') AS TIMESTAMP) as ts
             FROM range(1000)
           """)
   
           // Scalar (current_date / literal) vs dictionary-backed column: the 
reported crash.
           checkIcebergNativeScan("SELECT id, to_date(ts) FROM 
test_cat.db.ts_date_test ORDER BY id")
           checkIcebergNativeScan(
             "SELECT id, datediff(DATE '2025-01-01', ts) FROM 
test_cat.db.ts_date_test ORDER BY id")
           checkIcebergNativeScan(
             "SELECT COUNT(*) FROM test_cat.db.ts_date_test WHERE datediff(DATE 
'2025-01-01', ts) > 0")
   
           spark.sql("DROP TABLE test_cat.db.ts_date_test")
         }
       }
     }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to