rdblue commented on pull request #2182: URL: https://github.com/apache/iceberg/pull/2182#issuecomment-772786048
Thanks, @vvellanki. I think we will need to take a closer look at the plan for maintaining these files. I think we should track the last file and the snapshot it is based on, so we can apply diffs to it and update asynchronously. For the use case, I'm curious why you aren't using the actual files? In Spark, we push filters down to Iceberg before producing stats. Then stats are based on the actual files that will be scanned, which can be significantly smaller than just partition-level stats. That allows a lot more joins to be converted to broadcast joins. Is it possible to push filters earlier in your job planning? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
