gngj opened a new issue, #8565:
URL: https://github.com/apache/iceberg/issues/8565

   ### Apache Iceberg version
   
   1.3.1 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   ## Description
   Expiring snapshot by id breaks the ability to query using "TIMESTAMP AS OF" 
for older snapshots.
   
   ### Expected Behavior
   When expiring a specific snapshot, I expect that querying using "TIMESTAMP 
AS OF" should only work for all snapshots time frames. As all queries using 
"VERSION AS OF" are working for all snapshots except the expired one.
   
   ### Actual Behavior
   Currently, when I expire snapshot by id:
   - Using "TIMESTAMP AS OF" to query snapshot only works for snapshots newer 
from the snapshot. older ones get "java.lang.IllegalArgumentException: Cannot 
find a snapshot older than <timestamp>" error.
   - Using "VERSION AS OF" works for all snapshots.
   
   ### Steps to Reproduce
   1. `docker run --rm -it shmuelj613/spark-iceberg:3.4.1-1.3.1`
   2. `iceberg-spark-init`
   3. Run the following to fill the table with 5 snapshots
   ```
   spark.sql("CREATE TABLE iceberg.db.my_table (text string) USING iceberg;")
   spark.sql("INSERT INTO iceberg.db.my_table VALUES ('1');")
   spark.sql("INSERT INTO iceberg.db.my_table VALUES ('2');")
   var first_ver = spark.sql("select 
current_timestamp();").first().getTimestamp(0).toString()
   spark.sql("INSERT INTO iceberg.db.my_table VALUES ('3');")
   spark.sql("INSERT INTO iceberg.db.my_table VALUES ('4');")
   var sec_ver = spark.sql("select 
current_timestamp();").first().getTimestamp(0).toString()
   spark.sql("INSERT INTO iceberg.db.my_table VALUES ('5');")
   ```
   5. Run the following to expire 3rd snapshot
   ```
   var snapshot_to_exp = spark.sql("SELECT snapshot_id FROM 
iceberg.db.my_table.snapshots order by committed_at limit 1 offset 
2;").first.getLong(0).toString()
   spark.sql("CALL iceberg.system.expire_snapshots(table => 'db.my_table', 
snapshot_ids => ARRAY(" + snapshot_to_exp + "))")
   ```
   6. Use "TIMESTAMP AS OF" with a timestamp between Snapshot 4 and 5. (works)
   ```
   spark.sql("select * from iceberg.db.my_table TIMESTAMP AS OF'" + sec_ver + 
"';").show()
   ```
   7. Use "TIMESTAMP AS OF" with a timestamp between Snapshot 1 and 2. (errors)
   ```
   spark.sql("select * from iceberg.db.my_table TIMESTAMP AS OF'" + first_ver + 
"';").show()
   ```
   8. Use "VERSION AS OF" for various snapshot versions. (works)
   ```
   var snap_id = spark.sql("SELECT snapshot_id FROM 
iceberg.db.my_table.snapshots order by committed_at limit 1 offset 
1;").first.getLong(0).toString()
   spark.sql("select * from iceberg.db.my_table VERSION AS OF'" + snap_id + 
"';").show()
   ```
   9. Observe the behavior described above.
   
   The docker image is based on 
[https://www.dremio.com/blog/introduction-to-apache-iceberg-using-spark/](url). 
Just updated versions.
   
   ### Additional info
   I observed that snapshot metadata table and 'snapshots' field in metadata 
file have all snapshots.
   history metadata table and 'snapshot-log' field in metadata file had only 
last 2 versions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to