fhan688 commented on code in PR #19041:
URL: https://github.com/apache/hudi/pull/19041#discussion_r3445692494
##########
hudi-common/src/main/java/org/apache/hudi/common/util/CleanerUtils.java:
##########
@@ -151,8 +152,16 @@ public static Option<HoodieInstant>
getEarliestCommitToRetain(
} else if (cleaningPolicy == HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS) {
ZonedDateTime latestDateTime = ZonedDateTime.ofInstant(latestInstant,
timeZone.getZoneId());
String earliestTimeToRetain =
TimelineUtils.formatDate(Date.from(latestDateTime.minusHours(hoursRetained).toInstant()));
- earliestCommitToRetain =
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream().filter(i
-> compareTimestamps(i.requestedTime(),
- GREATER_THAN_OR_EQUALS, earliestTimeToRetain)).findFirst());
+ earliestCommitToRetain =
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream()
+ .filter(i -> compareTimestamps(i.requestedTime(),
GREATER_THAN_OR_EQUALS, earliestTimeToRetain))
+ .findFirst());
+
+ Option<HoodieInstant> earliestPendingCommit = commitsTimeline.filter(s
-> !s.isCompleted()).firstInstant();
+ if (earliestPendingCommit.isPresent()
+ && (!earliestCommitToRetain.isPresent()
+ || compareTimestamps(earliestPendingCommit.get().requestedTime(),
LESSER_THAN, earliestCommitToRetain.get().requestedTime()))) {
+ earliestCommitToRetain =
completedCommitsTimeline.findInstantsBefore(earliestPendingCommit.get().requestedTime()).lastInstant();
Review Comment:
> If the earliest pending commit precedes all completed commits,
`findInstantsBefore(...).lastInstant()` returns empty and the by-hours ECTR
collapses to `Option.empty()`. Please confirm an empty ECTR means "retain
everything" (not "clean everything") downstream, and add a test for the
pending-older-than-all-completed case — the new test only covers a completed
commit existing before the pending one.
Yes, `findInstantsBefore(...).lastInstant()` can return empty in that case.
Downstream, an empty ECTR means no partition scan/no cleaning for commit/hour
based cleaning: `CleanPlanner#getPartitionPathsForCleanByCommits`
returns an empty partition list when the retained instant is absent.
I added a test for the pending-before-all-completed case. I also tightened
the guard so the pending override only applies when the initial by-hours ECTR
is present. This avoids turning a "no completed instant after
cutoff" case into a synthetic ECTR based on the pending instant.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]