RussellSpitzer commented on pull request #2591: URL: https://github.com/apache/iceberg/pull/2591#issuecomment-845939886
Thanks for discussing it with me, I definitely want to make this efficient and I know some of our future plans haven’t been documented yet. On thing I really want to finish is distributed planning where no machine ever gets the full scan plan. Sent from my iPhone > On May 21, 2021, at 12:11 AM, Jack Ye ***@***.***> wrote: > > > We talked about this previously as a possible post-merge, post-delete, post-rewrite sort of thing. > > Cool, that cleanUnreferencedDeleteFiles() was just a divergent thought, great that we already thought about it. > > If file C for example is the correct size, and we never need to rewrite it, we never clean up those deletes so we still have to make another sort of action to clean up those files. > > Yes, that goes back to what I was thinking before, if we can have an option to force check the delete file and avoid filtering it out of the rewrite, then it should work. > > But I think I am starting to see where you are coming from. If this is done as a different action then we can save the write time if the file read does not contain any rows to delete in the delete file. To enable such a check in Spark, it cannot use the same code path that fully read all the rows and write it back. So it probably does not make sense to add delete functionality from that perspective. Thanks for the clarification! > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub, or unsubscribe. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
