mbutrovich commented on issue #3255:
URL:
https://github.com/apache/datafusion-comet/issues/3255#issuecomment-4672339575
I wasn't able to reproduce this with the current Iceberg code path. I
suspect this performed the scan in Iceberg Java, and then somehow they made
their way over to Comet without unpacking the dictionaries. I wrote a
reproducer and couldn't get it to fail. Please reopen if you re-encounter the
issue. Thanks for the report!
```scala
// https://github.com/apache/datafusion-comet/issues/3255
// Reported as CometNativeException "Cannot perform binary operation on
arrays of
// different length" when applying to_date / datediff to a timestamp
column read from
// an Iceberg table. The suspected trigger is dictionary-encoded timestamp
values, so
// we insert many rows with few distinct timestamps to encourage
dictionary encoding.
test("to_date and datediff on Iceberg timestamp column - reproduces
#3255") {
assume(icebergAvailable, "Iceberg not available in classpath")
withTempIcebergDir { warehouseDir =>
withSQLConf(
"spark.sql.catalog.test_cat" ->
"org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog.test_cat.type" -> "hadoop",
"spark.sql.catalog.test_cat.warehouse" ->
warehouseDir.getAbsolutePath,
CometConf.COMET_ENABLED.key -> "true",
CometConf.COMET_EXEC_ENABLED.key -> "true",
CometConf.COMET_ICEBERG_NATIVE_ENABLED.key -> "true") {
// Spark TIMESTAMP maps to Iceberg timestamptz (TIMESTAMPTZ),
matching the report.
spark.sql("""
CREATE TABLE test_cat.db.ts_date_test (
id STRING,
ts TIMESTAMP
) USING iceberg
""")
// 1000 rows over only 5 distinct timestamps -> highly compressible
/ dictionary-friendly.
spark.sql("""
INSERT INTO test_cat.db.ts_date_test
SELECT
CAST(id AS STRING) as id,
CAST(CONCAT('2024-01-0', CAST(id % 5 + 1 AS STRING), '
12:00:00') AS TIMESTAMP) as ts
FROM range(1000)
""")
// Scalar (current_date / literal) vs dictionary-backed column: the
reported crash.
checkIcebergNativeScan("SELECT id, to_date(ts) FROM
test_cat.db.ts_date_test ORDER BY id")
checkIcebergNativeScan(
"SELECT id, datediff(DATE '2025-01-01', ts) FROM
test_cat.db.ts_date_test ORDER BY id")
checkIcebergNativeScan(
"SELECT COUNT(*) FROM test_cat.db.ts_date_test WHERE datediff(DATE
'2025-01-01', ts) > 0")
spark.sql("DROP TABLE test_cat.db.ts_date_test")
}
}
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]