chenjunjiedada commented on pull request #2372: URL: https://github.com/apache/iceberg/pull/2372#issuecomment-811131816
@rdblue @openinx , I think the goal here is to provide more fine-grained compaction actions. Let me show more background. We have many internal flink jobs that consume dozens of billions of messages from the MQ system and sink to the iceberg every day. Since the user wants to see data ASAP so they usually set checkpoint in a minute or less. As a result, it produces a huge amount of small files on HDFS. To optimize the read performance, we have to compact or cluster the small files while compaction or clustering itself needs resources and brings overhead for the cluster. To mitigate overhead for the name node and save the resource for the user, we optimized the compaction action to fine-grained actions with predicate and group by partition. As we are going to support consuming CDC streaming data, I suppose there will be a lot of equality deletes and position deletes files. So we need more fine-grained actions to optimize the read path like what we did for data file compaction. Actually, we have four kinds of compaction for deletes. 1. Convert all equality deletes to position deletes. 2. Cluster all position deletes to one. 3. Convert all equality deletes and position deletes to one position deletes. 4. Remove all deletes. From my understanding, the first three compactions are minor compaction, and the last is a major one. The first and second compaction only need a few compute and IO resources, and they can also achieve the almost same optimization effect if we run the first and then the second. Of course we could implement the third finally as well. The point is we want to provide fine-grained options to users and they could apply strategies according to the cluster situations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
