aokolnychyi commented on pull request #2591:
URL: https://github.com/apache/iceberg/pull/2591#issuecomment-856093977


   I agree about the complexity of figuring out whether a delete file is still 
needed and that it is not really specific to compaction. Ideally, the algorithm 
should be generic and efficient enough so that we can apply it beyond 
compaction use cases. I think having partition-level summaries may help us 
there (i.e. knowing the min sequence of data files per partition).
   
   There is probably another question that we can consider sooner (but after 
this PR). We could pick a file if it is optimal in size but requires us to 
apply a lot of delete files on scan. It probably makes sense to include such 
data files for rewrites as the new file will have a higher sequence number so 
the deletes will no longer apply. Again, that's something we can add in the 
future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to