danny0405 commented on code in PR #11440:
URL: https://github.com/apache/hudi/pull/11440#discussion_r1674893075


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/timeline/HoodieTimelineArchiver.java:
##########
@@ -264,20 +268,46 @@ private List<HoodieInstant> getCommitInstantsToArchive() 
throws IOException {
         .map(Option::get)
         .min(HoodieInstant.COMPARATOR);
 
-    // Step2: We cannot archive any commits which are made after the first 
savepoint present,
+    // Step2: We cannot archive any commits which are later than 
EarliestCommitToNotArchive.
     // unless HoodieArchivalConfig#ARCHIVE_BEYOND_SAVEPOINT is enabled.
-    Option<HoodieInstant> firstSavepoint = 
table.getCompletedSavepointTimeline().firstInstant();
+
+    // EarliestCommitToNotArchive is required to block archival when savepoint 
is deleted.
+    // This ensures that archival is blocked until clean has cleaned up files 
retained due to savepoint.
+    // Since EarliestCommitToNotArchive is advanced by cleaner, it is a way 
for cleaner to notify the archival
+    // that cleanup has finished and archival can advance further.
+    // If this guard is not present, the archival of commits can lead to 
duplicates. Here is a scenario
+    // illustrating the same. This scenario considers a case where 
EarliestCommitToNotArchive guard is not present
+    // c1.dc - f1 (c1 deltacommit creates file with id f1)
+    // c2.dc - f2 (c2 deltacommit creates file with id f2)
+    // c2.sp - Savepoint at c2
+    // c3.rc (replacing f2 -> f3) (Replace commit replacing file id f2 with f3)
+    // c4.dc
+    //
+    // Lets say Incremental cleaner moved past the c3.rc without cleaning f2 
since savepoint is created at c2.
+    // Archival is blocked at c2 since there is a savepoint at c2.
+    // Lets say the savepoint at c2 is now deleted, Archival would archive 
c3.rc since it is unblocked now.
+    // Since c3 is archived and f2 has not been cleaned, the table view would 
be considering f2 as a valid
+    // file id. This causes duplicates.

Review Comment:
   Can we put the logic inside `getEarliestInstantToRetainForClustering` to 
make the logic around replace_commit more concentrated and there is no need to 
go through the replace_commit timeline multiple times?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to