[I] Spark Parquet timestamp min/max statistics [iceberg]

via GitHub Wed, 21 Jan 2026 08:13:25 -0800


mdomaradzki-cosmose opened a new issue, #15103:
URL: https://github.com/apache/iceberg/issues/15103


   ### Query engine
   
   Spark 4.0.1
   Iceberg 1.10.0
   
   ### Question
   
   I have a table in Iceberg as below:
   
   spark.sql("""
   CREATE OR REPLACE TABLE my_db.my_table (
       serverTime                     TIMESTAMP,
       id                             LONG,
   ...
   )
   USING iceberg
   PARTITIONED BY (days(serverTime), bucket(50, id));
   """
   )
   
   I run below queries:
   
   spark.sql("SELECT count(*) FROM my_db.my_table WHERE serverTime >= 
'2026-01-15 00:00:00';")
   
   
   spark.sql("SELECT count(*) FROM my_db.my_table WHERE serverTime > 
'2026-01-15 00:00:00';")
   
   First query finishes almost instantly because it can filter whole partitions.
   However second query takes a lot of time (around 15 minutes). I checked 
statistics per Parquet file and for date('2026-01-15') lower_bounds for 
serverTime is always bigger than '2026-01-15 00:00:00', shouldn't Spark/Iceberg 
use these metastatistics instead of reading all files?
   
   I'm using Spark 4.0.1 and Iceberg 1.10.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Spark Parquet timestamp min/max statistics [iceberg]

Reply via email to