[GitHub] [hudi] majian1998 commented on a diff in pull request #9243: [HUDI-6574] Fix the problem that incremental clean cannot be executed when the earliest ActiveTimeline is a pending commit.

via GitHub Thu, 20 Jul 2023 21:51:03 -0700


majian1998 commented on code in PR #9243:
URL: https://github.com/apache/hudi/pull/9243#discussion_r1270221553



##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestCleanPlanExecutor.java:
##########
@@ -683,4 +694,56 @@ public void testKeepXHoursWithCleaning(
     assertFalse(testTable.baseFileExists(p0, firstCommitTs, file1P0C0));
     assertFalse(testTable.baseFileExists(p1, firstCommitTs, file1P1C0));
   }
+
+  @Test
+  public void testGetEarliestCommitToRetain() {
+    HoodieWriteConfig config = HoodieWriteConfig.newBuilder()
+            .withPath(basePath)
+            .withSchema(HoodieTestDataGenerator.TRIP_EXAMPLE_SCHEMA)
+            .withMetadataConfig(HoodieMetadataConfig.newBuilder()
+                    .withAssumeDatePartitioning(true)
+                    .build())
+            .withAutoCommit(false)
+            .withCleanConfig(HoodieCleanConfig.newBuilder()
+                    .withIncrementalCleaningMode(true)
+                    
.withFailedWritesCleaningPolicy(HoodieFailedWritesCleaningPolicy.LAZY)
+                    
.withCleanerPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS)
+                    .retainCommits(5)
+                    .build())
+            .build();
+    SparkRDDWriteClient writeClient = getHoodieWriteClient(config);
+    IntStream.rangeClosed(1, 9).mapToObj(i -> {
+      String newCommitTime = "00" + i;
+      List<HoodieRecord> records = dataGen.generateInserts(newCommitTime, 10);
+      JavaRDD<HoodieRecord> writeRecords = jsc.parallelize(records, 1);
+      writeClient.startCommitWithTime(newCommitTime);
+      JavaRDD<WriteStatus> writeStatues = writeClient.insert(writeRecords, 
newCommitTime);
+      // Assuming the first commit is pending, simulating the situation where 
all instants before the first pending commit have been achieved.

Review Comment:
   Your understanding of the premise is correct, but a pending commit may be 
caused by delayed heartbeats or the execution of a replace commit that takes a 
long time, resulting in many commits on the timeline. In this case, I think it 
is reasonable for the archive to remove all commits before this pending commit. 
However, this can lead to the problem I mentioned, that is, when performing 
incremental clean, the endpoint of the last clean record cannot be found as the 
starting point of this clean, which will result in a full clean. If we mark the 
pending commit record as the endpoint of this case in 
getPartitionPathsForIncrementalCleaning, we can find a timeline greater than or 
equal to the starting point and less than the endpoint, which means that the 
pending commit will not be removed. This operation maintains the original 
intention without changing other situations and solves this problem. I'm not 
sure if I've expressed my understanding clearly, do you think what I said makes 
sense
 ?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] majian1998 commented on a diff in pull request #9243: [HUDI-6574] Fix the problem that incremental clean cannot be executed when the earliest ActiveTimeline is a pending commit.

Reply via email to