aokolnychyi commented on pull request #2591: URL: https://github.com/apache/iceberg/pull/2591#issuecomment-856093977
I agree about the complexity of figuring out whether a delete file is still needed and that it is not really specific to compaction. Ideally, the algorithm should be generic and efficient enough so that we can apply it beyond compaction use cases. I think having partition-level summaries may help us there (i.e. knowing the min sequence of data files per partition). There is probably another question that we can consider sooner (but after this PR). We could pick a file if it is optimal in size but requires us to apply a lot of delete files on scan. It probably makes sense to include such data files for rewrites as the new file will have a higher sequence number so the deletes will no longer apply. Again, that's something we can add in the future. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
