szehon-ho commented on PR #4736: URL: https://github.com/apache/iceberg/pull/4736#issuecomment-1176854697
Hi @aokolnychyi @rdblue , update on this discussion (see main changes in #3457). I implemented and was playing around with these optimizations in calculating the delete candidate files: (skipping valid reference_snapshot_ids, and skipping valid manifests). But I find actually that this skip : ```all_manifest table.reference_snapshot_id in set(expired snapshots ids)``` makes a positive difference. The result is the same as what I saw initially, reduce time of Spark job 30-40% in the case of expiring 1 snapshot out of many snapshots. As each manifest in all-manifest table is now marked with snapshot id, I am thinking the first filter already kind of filters out any current reachable manifests. So implementing the skipping of valid manifests actually makes the performance worse, as it adds one more Spark read of all_manifest table and a Spark join for little gain. Unless there is a scenario I am not thinking of. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
