Anton-Tarazi opened a new issue, #2604:
URL: https://github.com/apache/iceberg-python/issues/2604

   ### Feature Request / Improvement
   
   Running an expire snapshots operation will only rewrite the metadata file 
without the expired snapshots (and refs/ statistics). It does not delete 
deleted data files referenced only by the expired snapshots. This can be 
observed by deleting an entire table and calling `expire_snapshots` - the data 
files still exist. Trino and spark both clean up deleted data files when all 
snapshots referencing them are expired. 
   
   From the spec:
   ```
   When a file is replaced or deleted from the dataset, its manifest entry 
fields store the snapshot ID in which the file 
   was deleted and status 2 (deleted). The file may be deleted from the file 
system when the snapshot in which it was 
   deleted is garbage collected, assuming that older snapshots have also been 
garbage collected [1].
   ...
   1.
   Technically, data files can be deleted when the last snapshot that contains 
the file as “live” data is garbage collected. 
   But this is harder to detect and requires finding the diff of multiple 
snapshots. It is easier to track what files are 
   deleted in a snapshot and delete them when that snapshot expires. It is not 
recommended to add a deleted file back 
   to a table. Adding a deleted file can lead to edge cases where incremental 
deletes can break table snapshots.
   ```
   
   Happy to work on this if others agree that this should be added :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to