zhuanshenbsj1 commented on PR #9039:
URL: https://github.com/apache/hudi/pull/9039#issuecomment-1605364258

   > > 3\. getOldestInstantToRetainForCompaction also needs to check 
earliestInstantToRetain like getOldestInstantToRetainForClustering, to ensure 
that the files related to the compaction instant have been cleaned up before 
archiving.
   > 
   > This is not a bug, and changes around this is unnecessary. The reason is 
that compaction is different from clustering in the sense that compaction does 
not add or delete any file group, while clustering generates a replacecommit 
that replaces existing file groups with new ones, so cleaning has to delete old 
file groups based on the information from the replacecommit in the active 
timeline. Even if the compaction commit is archived, the cleaning still behaves 
properly, as the old file slices that the compaction operation touches can 
still be identified and deleted.
   
   If table is partitioned by time, and the archiving strategy and cleaning 
strategy is aggressive
   
   like timeline: dc1,dc2,  compaction1.inflight, dc3, dc4, 
compaction2.inflight, dc5, dc6,
   dc1, dc2, compaction1.inflight are belongs to partation1
   dc3, dc4, compaction2.inflight are belongs to partation2
   dc5, dc6 are belongs to partation3
   
   If the cleaning reaches dc2,  at a later point in time, compaction1 and 
compaction2 have just completed, and the archiving operation will archive all 
the previous instant before dc5. At this time, executing incremental clean 
cannot clean up partition1 and partition2
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to