jackye1995 commented on pull request #2372:
URL: https://github.com/apache/iceberg/pull/2372#issuecomment-845514501


   Finally get some time to catch up with all the delete works. In general I 
agree the delete marker sounds like the right way to go forward. Regarding the 
4 situations that Junjie described for his use cases, which are:
   
   1. Convert all equality deletes to position deletes.
   2. Cluster all position deletes to one.
   3. Convert all equality deletes and position deletes to one position deletes.
   4. Remove all deletes.
   
   However, these are based on the assumption that:
   
   1. we should always move files from equality deletes to position deletes to 
data files
   2. we should have as few delete files as possible
   
   Which are not 100% true in all situations. For example against 1, if we have 
tables that are well partitioned and sorted, and deletes are issued based on 
those partition and sort columns, then equality delete actually can consume way 
less memory and also perform better. For example against 2, having 1 single 
delete file means it has to be included in every single FileScanTask that might 
be executed by different workers and cannot share any cache, whereas if we have 
splitted those delete files, much fewer rows in delete files have to be read in 
each task. This also removed bottleneck of reading a single file with high 
parallelism which causes throttling in cloud storages.
   
   For major compaction, I think there is no doubt, it's the removal of all 
delete files, and the RewriteDataFiles work that Russell is doing should cover 
major compaction use case.
   
   But I feel everyone has a somewhat similar but different definition for 
minor compaction. I totally agree with Junjie that we should allow fine grained 
control for people to run a flexible set of actions based on the use case, and 
here is the definition in my mind:
   
   Major compaction: an action that takes all files in a snapshot and produces 
only data files
   Minor compaction: an action that takes all files in a snapshot and produces 
only delete files that are applied on top of the existing data files
   
   It seems to me that we should add an action similar to `RewriteDataFiles` 
and make another action framework, and we can implement different strategies 
for that action to fulfill different use cases described. What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to