[GitHub] [iceberg] aokolnychyi commented on pull request #4736: WIP: Improve performance of expire snapshot by not double-scanning non-expired manifests

GitBox Tue, 17 May 2022 20:35:23 -0700


aokolnychyi commented on PR #4736:
URL: https://github.com/apache/iceberg/pull/4736#issuecomment-1129530885


   I think this is an interesting idea. Let me summarize how I understand it.
   
   Right now, we compute a diff between the reachability sets before and after 
snapshot expiry. Whenever we build the reachability set before the expiry, we 
read manifests of all snapshots and that seems suboptimal. The assumption is 
that it is sufficient to just build the reachability set for expired snapshots 
and compare that to the reachability set after the snapshot expiry. Did I get 
that right?
   
   I guess one way to implement (and I believe this is what this PR tries to 
do) is to load the `FILES` metadata table for each expired snapshot and union 
`DataFrame`s to produce the reachability set. One potential problem with that 
is that we will load the manifest list for every expired snapshot on the 
driver, which can become a bottleneck if we expire a lot of snapshots. I've 
seen such cases.
   
   An alternative idea is to add some sort of `snapshot-ids` option to 
`ALL_MANIFESTS` metadata table and read only manifest lists for snapshots whose 
ID is in that list. That way, we won't read manifest lists and manifests added 
after the last expired snapshot. That's already an optimization. However, we 
can go even further and remove still live manifests before opening them, which 
could give even a bigger performance boost. Suppose an expired snapshot 
references `manifest-1` but we know it is still live. In that case, we can skip 
opening it up.
   
   Does that seem reasonable?
   
   @szehon-ho @RussellSpitzer @flyrain @rdblue @kbendick 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi commented on pull request #4736: WIP: Improve performance of expire snapshot by not double-scanning non-expired manifests

Reply via email to