Fokko commented on issue #2604: URL: https://github.com/apache/iceberg-python/issues/2604#issuecomment-3434101307
Hey @Anton-Tarazi Thanks for raising this! I do think cleaning up the snapshots makes sense, but it can be pretty expensive. It can be that the files that are referenced are still used by other snapshots. Of course, we could clean up the manifest-lists since they are unqiue per snapshot. We can do this on a best-effort basis; do the commit, and then delete the files. > (Once https://github.com/apache/iceberg-python/pull/1958 is merged one could just call remove_orphan_files after the expire_snapshots and the result would be the same, but I think its valuable to have expire_snapshots be consistent with the java version). I think there is also a difference here. If we expire a snapshot, we can easily list all the files that are related to that snapshot using the metadata. The `remove_orphan_files` will do a `list` operation on the object-store, which can be _pretty_ slow. I think if we want to clean up the data files, we could also collect a `Set` of the files that are in the expired snapshots, and compare that with the full metadata tree (we can use the metadata tables here). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
