zhuanshenbsj1 commented on PR #9039: URL: https://github.com/apache/hudi/pull/9039#issuecomment-1605364258
> > 3\. getOldestInstantToRetainForCompaction also needs to check earliestInstantToRetain like getOldestInstantToRetainForClustering, to ensure that the files related to the compaction instant have been cleaned up before archiving. > > This is not a bug, and changes around this is unnecessary. The reason is that compaction is different from clustering in the sense that compaction does not add or delete any file group, while clustering generates a replacecommit that replaces existing file groups with new ones, so cleaning has to delete old file groups based on the information from the replacecommit in the active timeline. Even if the compaction commit is archived, the cleaning still behaves properly, as the old file slices that the compaction operation touches can still be identified and deleted. If table is partitioned by time, and the archiving strategy and cleaning strategy is aggressive like timeline: dc1,dc2, compaction1.inflight, dc3, dc4, compaction2.inflight, dc5, dc6, dc1, dc2, compaction1.inflight are belongs to partation1 dc3, dc4, compaction2.inflight are belongs to partation2 dc5, dc6 are belongs to partation3 If the cleaning reaches dc2, at a later point in time, compaction1 and compaction2 have just completed, and the archiving operation will archive all the previous instant before dc5. At this time, executing incremental clean cannot clean up partition1 and partition2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
