dongkelun commented on a change in pull request #4016:
URL: https://github.com/apache/hudi/pull/4016#discussion_r774268460



##########
File path: 
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java
##########
@@ -750,6 +750,59 @@ public void testKeepLatestFileVersions(Boolean 
enableBootstrapSourceClean) throw
     assertTrue(testTable.baseFileExists(p0, "00000000000003", file3P0C2));
   }
 
+  @Test
+  public void testCleanEmptyInstants() throws Exception {
+    HoodieWriteConfig config =
+            HoodieWriteConfig.newBuilder()
+                    .withPath(basePath)
+                    
.withMetadataConfig(HoodieMetadataConfig.newBuilder().withAssumeDatePartitioning(true).build())
+                    .withCompactionConfig(HoodieCompactionConfig.newBuilder()
+                            
.withCleanerPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS).retainFileVersions(1).build())
+                    .build();
+    metaClient = HoodieTableMetaClient.reload(metaClient);
+
+    int commitCount = 20;
+    int cleanCount = 20;
+
+    int startInstant = 1;
+    for (int i = 0; i < commitCount; i++, startInstant++) {
+      String commitTime = makeNewCommitTime(startInstant);
+      HoodieTestTable.of(metaClient).addCommit(commitTime);
+    }
+
+    for (int i = 0; i < cleanCount; i++, startInstant++) {
+      String commitTime = makeNewCommitTime(startInstant);
+      createCleanMetadata(commitTime + "", false, true);
+    }
+
+    List<HoodieCleanStat> cleanStats = runCleaner(config);
+    HoodieActiveTimeline timeline = metaClient.reloadActiveTimeline();
+
+    assertEquals(0, cleanStats.size(), "Must not clean any files");
+    assertEquals(1, timeline.getTimelineOfActions(

Review comment:
       First, run the cleaner several times, and we will delete all empty 
instants,But if run it once, only one will be deleted.The process is probably 
like this:
   `HoodieCleanMetadata clean` -> `scheduleTableServiceInternal` -> 
`scheduleCleaning` -> `requestClean` -> `getPartitionPathsToClean` -> 
`getPartitionPathsForCleanByCommits`
   
   `Option<HoodieInstant> lastClean = 
hoodieTable.getCleanTimeline().filterCompletedInstants().lastInstant();`
   Only the last completed instant will be obtained here,
   So only the last .clean will be deleted here([000000040__clean__COMPLETED]), 
leaving the last one .inflight and .requested.
   
   Then:
   `createTable(config, hadoopConf).clean(context, cleanInstantTime, 
skipLocking)` ->  `HoodieCleanMetadata execute()`
   
   `List<HoodieInstant> pendingCleanInstants = table.getCleanTimeline()
           
.filterInflightsAndRequested().getInstants().collect(Collectors.toList());`
   Because there is only the last one that matches ` 
filterinflightsandrequested`,So only the last .inflight will be 
deleted,([==>000000040__clean__INFLIGHT]) leaving the last one  
.requested([==>000000040__clean__REQUESTED]).
   
   Execute the second time, the same logic:
   First: delete 000000039__clean__COMPLETED ,  leave :  
000000039__clean__INFLIGHT, 000000039__clean__REQUESTED
   Then filterInflightsAndRequested(): delete 000000040__clean__REQUESTED and 
000000039__clean__INFLIGHT
   
   Then, if we execute this multiple times, we can delete all.
   
   
   the assertions:
    
    `assertEquals(1, timeline.getTimelineOfActions(
               
CollectionUtils.createSet(HoodieTimeline.CLEAN_ACTION)).filterInflightsAndRequested().countInstants());`
     leave one .requested
     
   `assertEquals(0, timeline.getTimelineOfActions(
               
CollectionUtils.createSet(HoodieTimeline.CLEAN_ACTION)).filterInflights().countInstants());`
 
   not have .inflight 
   
   `assertEquals(--cleanCount, timeline.getTimelineOfActions(
               
CollectionUtils.createSet(HoodieTimeline.CLEAN_ACTION)).filterCompletedInstants().countInstants());`
      
   
   Because delete one .clean at a time, so the total is reduced by one  
   
   `
       assertTrue(timeline.getTimelineOfActions(
               
CollectionUtils.createSet(HoodieTimeline.CLEAN_ACTION)).filterInflightsAndRequested().containsInstant(makeNewCommitTime(--startInstant)));
   `
   The rest is `--startInstant__clean__REQUESTED`
   
   
   
   
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to