fhan688 commented on code in PR #19041:
URL: https://github.com/apache/hudi/pull/19041#discussion_r3458481471


##########
hudi-common/src/main/java/org/apache/hudi/common/util/CleanerUtils.java:
##########
@@ -151,8 +152,16 @@ public static Option<HoodieInstant> 
getEarliestCommitToRetain(
     } else if (cleaningPolicy == HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS) {
       ZonedDateTime latestDateTime = ZonedDateTime.ofInstant(latestInstant, 
timeZone.getZoneId());
       String earliestTimeToRetain = 
TimelineUtils.formatDate(Date.from(latestDateTime.minusHours(hoursRetained).toInstant()));
-      earliestCommitToRetain = 
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream().filter(i 
-> compareTimestamps(i.requestedTime(),
-          GREATER_THAN_OR_EQUALS, earliestTimeToRetain)).findFirst());
+      earliestCommitToRetain = 
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream()
+          .filter(i -> compareTimestamps(i.requestedTime(), 
GREATER_THAN_OR_EQUALS, earliestTimeToRetain))
+          .findFirst());
+
+      Option<HoodieInstant> earliestPendingCommit = commitsTimeline.filter(s 
-> !s.isCompleted()).firstInstant();
+      if (earliestPendingCommit.isPresent()
+          && (!earliestCommitToRetain.isPresent()
+          || compareTimestamps(earliestPendingCommit.get().requestedTime(), 
LESSER_THAN, earliestCommitToRetain.get().requestedTime()))) {
+        earliestCommitToRetain = 
completedCommitsTimeline.findInstantsBefore(earliestPendingCommit.get().requestedTime()).lastInstant();

Review Comment:
   > > Yes, findInstantsBefore(...).lastInstant() can return empty in that 
case. Downstream, an empty ECTR means no partition scan/no cleaning for 
commit/hour based cleanin
   > 
   > Could you elaborate why the cleaing be blocked by a pending instant here? 
since we have completion time file slicing(pending files always belong to the 
latest one), it is okey to forward the cleaning, and the ECTR should be based 
on completion time too?
   
   Thanks, this is a good point. I will re-check this against the current 
completion-time file slicing behavior. If pending files are always attached to 
the latest slice and the cleaner already preserves that slice, the by-
     hours pending guard may be overly conservative.
   
     I’ll verify this with a clean-plan level test instead of only testing 
`CleanerUtils`. I’ll also check whether the by-hours ECTR should be derived 
from completion time rather than requested time. If the existing file-
     slice protection is sufficient, I’ll remove the pending guard change and 
update the PR scope accordingly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to