ajantha-bhat commented on PR #4674: URL: https://github.com/apache/iceberg/pull/4674#issuecomment-1118111005
@szehon-ho : > @RussellSpitzer pointed me to this, I had a pr is orthogonal to this, to avoid duplicate computation of all_reachable_files here https://github.com/apache/iceberg/pull/3457 To me that was the bigger time consumer (exploring all reachable files), though maybe I need to re-do that pr. Wasn't sure how much bottleneck getting all_manifests was. yeah, scanning the all_manifest table twice was the major problem for me. > Anyway, agree with @RussellSpitzer that maybe cache is a better option than persist? It'd be great to see some numbers for tables with huge snapshots for these two options vs today, if possible. I think if , if we go with this approach, it should probably be 1) configurable , 2) able to be GC'ed sooner than later. Sure, I will make it configurable option to cache or not and get the performance report locally with large number of snapshots. I will work on this over this weekend. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
