fhan688 commented on code in PR #19041:
URL: https://github.com/apache/hudi/pull/19041#discussion_r3458481471
##########
hudi-common/src/main/java/org/apache/hudi/common/util/CleanerUtils.java:
##########
@@ -151,8 +152,16 @@ public static Option<HoodieInstant>
getEarliestCommitToRetain(
} else if (cleaningPolicy == HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS) {
ZonedDateTime latestDateTime = ZonedDateTime.ofInstant(latestInstant,
timeZone.getZoneId());
String earliestTimeToRetain =
TimelineUtils.formatDate(Date.from(latestDateTime.minusHours(hoursRetained).toInstant()));
- earliestCommitToRetain =
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream().filter(i
-> compareTimestamps(i.requestedTime(),
- GREATER_THAN_OR_EQUALS, earliestTimeToRetain)).findFirst());
+ earliestCommitToRetain =
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream()
+ .filter(i -> compareTimestamps(i.requestedTime(),
GREATER_THAN_OR_EQUALS, earliestTimeToRetain))
+ .findFirst());
+
+ Option<HoodieInstant> earliestPendingCommit = commitsTimeline.filter(s
-> !s.isCompleted()).firstInstant();
+ if (earliestPendingCommit.isPresent()
+ && (!earliestCommitToRetain.isPresent()
+ || compareTimestamps(earliestPendingCommit.get().requestedTime(),
LESSER_THAN, earliestCommitToRetain.get().requestedTime()))) {
+ earliestCommitToRetain =
completedCommitsTimeline.findInstantsBefore(earliestPendingCommit.get().requestedTime()).lastInstant();
Review Comment:
> > Yes, findInstantsBefore(...).lastInstant() can return empty in that
case. Downstream, an empty ECTR means no partition scan/no cleaning for
commit/hour based cleanin
>
> Could you elaborate why the cleaing be blocked by a pending instant here?
since we have completion time file slicing(pending files always belong to the
latest one), it is okey to forward the cleaning, and the ECTR should be based
on completion time too?
Thanks, this is a good point. I will re-check this against the current
completion-time file slicing behavior. If pending files are always attached to
the latest slice and the cleaner already preserves that slice, the by-
hours pending guard may be overly conservative.
I’ll verify this with a clean-plan level test instead of only testing
`CleanerUtils`. I’ll also check whether the by-hours ECTR should be derived
from completion time rather than requested time. If the existing file-
slice protection is sufficient, I’ll remove the pending guard change and
update the PR scope accordingly.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]