voonhous commented on code in PR #9243:
URL: https://github.com/apache/hudi/pull/9243#discussion_r1270291077


##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestCleanPlanExecutor.java:
##########
@@ -683,4 +694,56 @@ public void testKeepXHoursWithCleaning(
     assertFalse(testTable.baseFileExists(p0, firstCommitTs, file1P0C0));
     assertFalse(testTable.baseFileExists(p1, firstCommitTs, file1P1C0));
   }
+
+  @Test
+  public void testGetEarliestCommitToRetain() {
+    HoodieWriteConfig config = HoodieWriteConfig.newBuilder()
+            .withPath(basePath)
+            .withSchema(HoodieTestDataGenerator.TRIP_EXAMPLE_SCHEMA)
+            .withMetadataConfig(HoodieMetadataConfig.newBuilder()
+                    .withAssumeDatePartitioning(true)
+                    .build())
+            .withAutoCommit(false)
+            .withCleanConfig(HoodieCleanConfig.newBuilder()
+                    .withIncrementalCleaningMode(true)
+                    
.withFailedWritesCleaningPolicy(HoodieFailedWritesCleaningPolicy.LAZY)
+                    
.withCleanerPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS)
+                    .retainCommits(5)
+                    .build())
+            .build();
+    SparkRDDWriteClient writeClient = getHoodieWriteClient(config);
+    IntStream.rangeClosed(1, 9).mapToObj(i -> {
+      String newCommitTime = "00" + i;
+      List<HoodieRecord> records = dataGen.generateInserts(newCommitTime, 10);
+      JavaRDD<HoodieRecord> writeRecords = jsc.parallelize(records, 1);
+      writeClient.startCommitWithTime(newCommitTime);
+      JavaRDD<WriteStatus> writeStatues = writeClient.insert(writeRecords, 
newCommitTime);
+      // Assuming the first commit is pending, simulating the situation where 
all instants before the first pending commit have been achieved.

Review Comment:
   > In this case, I think it is reasonable for the archive to remove all 
commits before this pending commit
   
   I don't think this is reasonable. 
   
   ```java
     /**
      * A FileSlice is considered committed, if one of the following is true - 
There is a committed data file - There are
      * some log files, that are based off a commit or delta commit.
      */
     private boolean isFileSliceCommitted(FileSlice slice) {
       if (!compareTimestamps(slice.getBaseInstantTime(), 
LESSER_THAN_OR_EQUALS, lastInstant.get().getTimestamp())) {
         return false;
       }
   
       return 
timeline.containsOrBeforeTimelineStarts(slice.getBaseInstantTime());
     }
   ```
   
   Given your above assumption, a `fileSlice` that may be the output of a 
inflight replacecommit/clustering will be flagged as committed, when in 
actuality, it is not... 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to