aokolnychyi commented on PR #4736: URL: https://github.com/apache/iceberg/pull/4736#issuecomment-1129530885
I think this is an interesting idea. Let me summarize how I understand it. Right now, we compute a diff between the reachability sets before and after snapshot expiry. Whenever we build the reachability set before the expiry, we read manifests of all snapshots and that seems suboptimal. The assumption is that it is sufficient to just build the reachability set for expired snapshots and compare that to the reachability set after the snapshot expiry. Did I get that right? I guess one way to implement (and I believe this is what this PR tries to do) is to load the `FILES` metadata table for each expired snapshot and union `DataFrame`s to produce the reachability set. One potential problem with that is that we will load the manifest list for every expired snapshot on the driver, which can become a bottleneck if we expire a lot of snapshots. I've seen such cases. An alternative idea is to add some sort of `snapshot-ids` option to `ALL_MANIFESTS` metadata table and read only manifest lists for snapshots whose ID is in that list. That way, we won't read manifest lists and manifests added after the last expired snapshot. That's already an optimization. However, we can go even further and remove still live manifests before opening them, which could give even a bigger performance boost. Suppose an expired snapshot references `manifest-1` but we know it is still live. In that case, we can skip opening it up. Does that seem reasonable? @szehon-ho @RussellSpitzer @flyrain @rdblue @kbendick -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
