[GitHub] [iceberg] szehon-ho commented on pull request #4736: WIP: Improve performance of expire snapshot by not double-scanning non-expired manifests

GitBox Wed, 06 Jul 2022 16:30:51 -0700


szehon-ho commented on PR #4736:
URL: https://github.com/apache/iceberg/pull/4736#issuecomment-1176854697


   Hi @aokolnychyi @rdblue , update on this discussion (see main changes in 
#3457).  I  implemented and was playing around with these optimizations in 
calculating the delete candidate files:  (skipping valid 
reference_snapshot_ids, and skipping valid manifests).  But I find actually 
that this skip : 
   ```all_manifest table.reference_snapshot_id in  set(expired snapshots 
ids)``` makes a positive difference.  The result is the same as what I saw 
initially, reduce time of Spark job 30-40% in the case of expiring 1 snapshot 
out of many snapshots.  
   
   As each manifest in all-manifest table is now marked with snapshot id, I am 
thinking the first filter already kind of filters out any current reachable 
manifests.   So implementing the skipping of valid manifests actually makes the 
performance worse, as it adds one more Spark read of all_manifest table and a 
Spark join for little gain.  Unless there is a scenario I am not thinking of.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] szehon-ho commented on pull request #4736: WIP: Improve performance of expire snapshot by not double-scanning non-expired manifests

Reply via email to