chenjunjiedada commented on pull request #2372:
URL: https://github.com/apache/iceberg/pull/2372#issuecomment-811131816


   @rdblue @openinx , I think the goal here is to provide more fine-grained 
compaction actions. Let me show more background.
   
   We have many internal flink jobs that consume dozens of billions of messages 
from the MQ system and sink to the iceberg every day. Since the user wants to 
see data ASAP so they usually set checkpoint in a minute or less. As a result, 
it produces a huge amount of small files on HDFS. To optimize the read 
performance, we have to compact or cluster the small files while compaction or 
clustering itself needs resources and brings overhead for the cluster. To 
mitigate overhead for the name node and save the resource for the user, we 
optimized the compaction action to fine-grained actions with predicate and 
group by partition.
   
   As we are going to support consuming CDC streaming data, I suppose there 
will be a lot of equality deletes and position deletes files. So we need more 
fine-grained actions to optimize the read path like what we did for data file 
compaction. Actually, we have four kinds of compaction for deletes.
   
   1. Convert all equality deletes to position deletes.
   2. Cluster all position deletes to one.
   3. Convert all equality deletes and position deletes to one position deletes.
   4. Remove all deletes.
   
   From my understanding, the first three compactions are minor compaction, and 
the last is a major one. The first and second compaction only need a few 
compute and IO resources, and they can also achieve the almost same 
optimization effect if we run the first and then the second. Of course we could 
implement the third finally as well.  The point is we want to provide 
fine-grained options to users and they could apply strategies according to the 
cluster situations. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to