fhan688 commented on code in PR #19041:
URL: https://github.com/apache/hudi/pull/19041#discussion_r3445692494


##########
hudi-common/src/main/java/org/apache/hudi/common/util/CleanerUtils.java:
##########
@@ -151,8 +152,16 @@ public static Option<HoodieInstant> 
getEarliestCommitToRetain(
     } else if (cleaningPolicy == HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS) {
       ZonedDateTime latestDateTime = ZonedDateTime.ofInstant(latestInstant, 
timeZone.getZoneId());
       String earliestTimeToRetain = 
TimelineUtils.formatDate(Date.from(latestDateTime.minusHours(hoursRetained).toInstant()));
-      earliestCommitToRetain = 
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream().filter(i 
-> compareTimestamps(i.requestedTime(),
-          GREATER_THAN_OR_EQUALS, earliestTimeToRetain)).findFirst());
+      earliestCommitToRetain = 
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream()
+          .filter(i -> compareTimestamps(i.requestedTime(), 
GREATER_THAN_OR_EQUALS, earliestTimeToRetain))
+          .findFirst());
+
+      Option<HoodieInstant> earliestPendingCommit = commitsTimeline.filter(s 
-> !s.isCompleted()).firstInstant();
+      if (earliestPendingCommit.isPresent()
+          && (!earliestCommitToRetain.isPresent()
+          || compareTimestamps(earliestPendingCommit.get().requestedTime(), 
LESSER_THAN, earliestCommitToRetain.get().requestedTime()))) {
+        earliestCommitToRetain = 
completedCommitsTimeline.findInstantsBefore(earliestPendingCommit.get().requestedTime()).lastInstant();

Review Comment:
   > If the earliest pending commit precedes all completed commits, 
`findInstantsBefore(...).lastInstant()` returns empty and the by-hours ECTR 
collapses to `Option.empty()`. Please confirm an empty ECTR means "retain 
everything" (not "clean everything") downstream, and add a test for the 
pending-older-than-all-completed case — the new test only covers a completed 
commit existing before the pending one.
   
   Yes, `findInstantsBefore(...).lastInstant()` can return empty in that case. 
Downstream, an empty ECTR means no partition scan/no cleaning for commit/hour 
based cleaning: `CleanPlanner#getPartitionPathsForCleanByCommits`
     returns an empty partition list when the retained instant is absent.
   
     I added a test for the pending-before-all-completed case. I also tightened 
the guard so the pending override only applies when the initial by-hours ECTR 
is present. This avoids turning a "no completed instant after
     cutoff" case into a synthetic ECTR based on the pending instant.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to