yihua commented on code in PR #19041:
URL: https://github.com/apache/hudi/pull/19041#discussion_r3445501704
##########
hudi-common/src/main/java/org/apache/hudi/common/util/CleanerUtils.java:
##########
@@ -151,8 +152,16 @@ public static Option<HoodieInstant>
getEarliestCommitToRetain(
} else if (cleaningPolicy == HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS) {
ZonedDateTime latestDateTime = ZonedDateTime.ofInstant(latestInstant,
timeZone.getZoneId());
String earliestTimeToRetain =
TimelineUtils.formatDate(Date.from(latestDateTime.minusHours(hoursRetained).toInstant()));
- earliestCommitToRetain =
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream().filter(i
-> compareTimestamps(i.requestedTime(),
- GREATER_THAN_OR_EQUALS, earliestTimeToRetain)).findFirst());
+ earliestCommitToRetain =
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream()
+ .filter(i -> compareTimestamps(i.requestedTime(),
GREATER_THAN_OR_EQUALS, earliestTimeToRetain))
Review Comment:
This still selects the first instant `>=` the cutoff, but the PR description
says the fix is to pick the instant at or before the cutoff. That change is not
here, so the sparse-commit-density bug described in the issue is not actually
addressed; only the new pending-instant guard below is. Please reconcile the
description with the code (or add the intended change).
##########
hudi-common/src/main/java/org/apache/hudi/common/util/CleanerUtils.java:
##########
@@ -151,8 +152,16 @@ public static Option<HoodieInstant>
getEarliestCommitToRetain(
} else if (cleaningPolicy == HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS) {
ZonedDateTime latestDateTime = ZonedDateTime.ofInstant(latestInstant,
timeZone.getZoneId());
String earliestTimeToRetain =
TimelineUtils.formatDate(Date.from(latestDateTime.minusHours(hoursRetained).toInstant()));
- earliestCommitToRetain =
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream().filter(i
-> compareTimestamps(i.requestedTime(),
- GREATER_THAN_OR_EQUALS, earliestTimeToRetain)).findFirst());
+ earliestCommitToRetain =
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream()
+ .filter(i -> compareTimestamps(i.requestedTime(),
GREATER_THAN_OR_EQUALS, earliestTimeToRetain))
+ .findFirst());
+
+ Option<HoodieInstant> earliestPendingCommit = commitsTimeline.filter(s
-> !s.isCompleted()).firstInstant();
+ if (earliestPendingCommit.isPresent()
+ && (!earliestCommitToRetain.isPresent()
+ || compareTimestamps(earliestPendingCommit.get().requestedTime(),
LESSER_THAN, earliestCommitToRetain.get().requestedTime()))) {
+ earliestCommitToRetain =
completedCommitsTimeline.findInstantsBefore(earliestPendingCommit.get().requestedTime()).lastInstant();
Review Comment:
If the earliest pending commit precedes all completed commits,
`findInstantsBefore(...).lastInstant()` returns empty and the by-hours ECTR
collapses to `Option.empty()`. Please confirm an empty ECTR means "retain
everything" (not "clean everything") downstream, and add a test for the
pending-older-than-all-completed case. The new test only covers a completed
commit existing before the pending one.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]