majian1998 commented on code in PR #9243:
URL: https://github.com/apache/hudi/pull/9243#discussion_r1270221553
##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestCleanPlanExecutor.java:
##########
@@ -683,4 +694,56 @@ public void testKeepXHoursWithCleaning(
assertFalse(testTable.baseFileExists(p0, firstCommitTs, file1P0C0));
assertFalse(testTable.baseFileExists(p1, firstCommitTs, file1P1C0));
}
+
+ @Test
+ public void testGetEarliestCommitToRetain() {
+ HoodieWriteConfig config = HoodieWriteConfig.newBuilder()
+ .withPath(basePath)
+ .withSchema(HoodieTestDataGenerator.TRIP_EXAMPLE_SCHEMA)
+ .withMetadataConfig(HoodieMetadataConfig.newBuilder()
+ .withAssumeDatePartitioning(true)
+ .build())
+ .withAutoCommit(false)
+ .withCleanConfig(HoodieCleanConfig.newBuilder()
+ .withIncrementalCleaningMode(true)
+
.withFailedWritesCleaningPolicy(HoodieFailedWritesCleaningPolicy.LAZY)
+
.withCleanerPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS)
+ .retainCommits(5)
+ .build())
+ .build();
+ SparkRDDWriteClient writeClient = getHoodieWriteClient(config);
+ IntStream.rangeClosed(1, 9).mapToObj(i -> {
+ String newCommitTime = "00" + i;
+ List<HoodieRecord> records = dataGen.generateInserts(newCommitTime, 10);
+ JavaRDD<HoodieRecord> writeRecords = jsc.parallelize(records, 1);
+ writeClient.startCommitWithTime(newCommitTime);
+ JavaRDD<WriteStatus> writeStatues = writeClient.insert(writeRecords,
newCommitTime);
+ // Assuming the first commit is pending, simulating the situation where
all instants before the first pending commit have been achieved.
Review Comment:
Your understanding of the premise is correct, but a pending commit may be
caused by delayed heartbeats or the execution of a replace commit that takes a
long time, resulting in many commits on the timeline. In this case, I think it
is reasonable for the archive to remove all commits before this pending commit.
However, this can lead to the problem I mentioned, that is, when performing
incremental clean, the endpoint of the last clean record cannot be found as the
starting point of this clean, which will result in a full clean. If we mark the
pending commit record as the endpoint of this case in
getPartitionPathsForIncrementalCleaning, we can find a timeline greater than or
equal to the starting point and less than the endpoint, which means that the
pending commit will not be removed. This operation maintains the original
intention without changing other situations and solves this problem. I'm not
sure if I've expressed my understanding clearly, do you think what I said makes
sense
?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]