mdomaradzki-cosmose opened a new issue, #15103:
URL: https://github.com/apache/iceberg/issues/15103
### Query engine
Spark 4.0.1
Iceberg 1.10.0
### Question
I have a table in Iceberg as below:
spark.sql("""
CREATE OR REPLACE TABLE my_db.my_table (
serverTime TIMESTAMP,
id LONG,
...
)
USING iceberg
PARTITIONED BY (days(serverTime), bucket(50, id));
"""
)
I run below queries:
spark.sql("SELECT count(*) FROM my_db.my_table WHERE serverTime >=
'2026-01-15 00:00:00';")
spark.sql("SELECT count(*) FROM my_db.my_table WHERE serverTime >
'2026-01-15 00:00:00';")
First query finishes almost instantly because it can filter whole partitions.
However second query takes a lot of time (around 15 minutes). I checked
statistics per Parquet file and for date('2026-01-15') lower_bounds for
serverTime is always bigger than '2026-01-15 00:00:00', shouldn't Spark/Iceberg
use these metastatistics instead of reading all files?
I'm using Spark 4.0.1 and Iceberg 1.10.0
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]