[GitHub] [iceberg] jackye1995 commented on pull request #2591: Spark: RewriteDatafilesAction V2

GitBox Thu, 20 May 2021 22:11:00 -0700


jackye1995 commented on pull request #2591:
URL: https://github.com/apache/iceberg/pull/2591#issuecomment-845657683



   > We talked about this previously as a possible post-merge, post-delete, 
post-rewrite sort of thing. 
   
   Cool, that `cleanUnreferencedDeleteFiles()` was just a divergent thought, 
great that we already thought about it.
   
   > If file C for example is the correct size, and we never need to rewrite 
it, we never clean up those deletes so we still have to make another sort of 
action to clean up those files.
   
   Yes, that goes back to what I was thinking before, if we can have an option 
to force check the delete file and avoid filtering it out of the rewrite, then 
it should work.
   
   But I think I am starting to see where you are coming from. If this is done 
as a different action then we can save the write time if the file read does not 
contain any rows to delete in the delete file. To enable such a check in Spark, 
it cannot use the same code path that fully read all the rows and write it 
back. So it probably does not make sense to add delete functionality from that 
perspective. Thanks for the clarification!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] jackye1995 commented on pull request #2591: Spark: RewriteDatafilesAction V2

Reply via email to