RussellSpitzer edited a comment on pull request #2591:
URL: https://github.com/apache/iceberg/pull/2591#issuecomment-845645797


   > > If we now want to check if we can remove Delete File A we only have to 
read files C and D so we actually
   > > made progress.
   > 
   > I think this is the place I am a bit confused. A' and B' don't need delete 
file A for sure because sequence number of A' and B' is higher. But we don't 
read C and D to add delete file A to C and D's FileScanTask. It's done by 
reading the statistics of delete file A and determined by the partition filter. 
As long as there are files of lower sequence number in that partition, the 
delete file will be included to that file scan task.
   > 
   > This means that if we can have a counter for each delete file and expose a 
method `cleanUnreferencedDeleteFiles()` called after `planFileGroups()`, we can 
naturally get all the files compacted just by running bin packing continuously.
   
   We don't read C and D to check if delete file A is used in C and D's scan 
task. But we don't know if Delete File A is needed without reading C and D. 
Delete File A may not touch any rows in files C and D or it may. We can count 
references, but this ends up being a problem if we don't actually include all 
files because we have pushed down our filters size filters to the manifest 
reader itself (future plans) and it only helps us in the luckiest case where we 
know a delete file has been completely dereferenced. If file C for example is 
the correct size, and we never need to rewrite it, we never clean up those 
deletes so we still have to make another sort of action to clean up those files.
   
   That said, the same sweep of delete-files you are describing should probably 
just be performed on every action which deletes files, and not just during bin 
pack since that particular method of eliminating files is very cheap but also 
unlikely to actually pick up any delete files for removal. We talked about this 
previously as a possible post-merge, post-delete, post-rewrite sort of thing. 
But again since that kind of clean up never conflicts, we can really do it at 
any time without a lot of cost to perf.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to