yihua commented on code in PR #19041:
URL: https://github.com/apache/hudi/pull/19041#discussion_r3445501705
##########
hudi-common/src/main/java/org/apache/hudi/common/util/CleanerUtils.java:
##########
@@ -151,8 +152,16 @@ public static Option<HoodieInstant>
getEarliestCommitToRetain(
} else if (cleaningPolicy == HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS) {
ZonedDateTime latestDateTime = ZonedDateTime.ofInstant(latestInstant,
timeZone.getZoneId());
String earliestTimeToRetain =
TimelineUtils.formatDate(Date.from(latestDateTime.minusHours(hoursRetained).toInstant()));
- earliestCommitToRetain =
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream().filter(i
-> compareTimestamps(i.requestedTime(),
- GREATER_THAN_OR_EQUALS, earliestTimeToRetain)).findFirst());
+ earliestCommitToRetain =
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream()
+ .filter(i -> compareTimestamps(i.requestedTime(),
GREATER_THAN_OR_EQUALS, earliestTimeToRetain))
+ .findFirst());
+
+ Option<HoodieInstant> earliestPendingCommit = commitsTimeline.filter(s
-> !s.isCompleted()).firstInstant();
+ if (earliestPendingCommit.isPresent()
+ && (!earliestCommitToRetain.isPresent()
+ || compareTimestamps(earliestPendingCommit.get().requestedTime(),
LESSER_THAN, earliestCommitToRetain.get().requestedTime()))) {
+ earliestCommitToRetain =
completedCommitsTimeline.findInstantsBefore(earliestPendingCommit.get().requestedTime()).lastInstant();
Review Comment:
If the earliest pending commit precedes all completed commits,
`findInstantsBefore(...).lastInstant()` returns empty and the by-hours ECTR
collapses to `Option.empty()`. Please confirm an empty ECTR means "retain
everything" (not "clean everything") downstream, and add a test for the
pending-older-than-all-completed case — the new test only covers a completed
commit existing before the pending one.
##########
hudi-common/src/main/java/org/apache/hudi/common/util/CleanerUtils.java:
##########
@@ -151,8 +152,16 @@ public static Option<HoodieInstant>
getEarliestCommitToRetain(
} else if (cleaningPolicy == HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS) {
ZonedDateTime latestDateTime = ZonedDateTime.ofInstant(latestInstant,
timeZone.getZoneId());
String earliestTimeToRetain =
TimelineUtils.formatDate(Date.from(latestDateTime.minusHours(hoursRetained).toInstant()));
- earliestCommitToRetain =
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream().filter(i
-> compareTimestamps(i.requestedTime(),
- GREATER_THAN_OR_EQUALS, earliestTimeToRetain)).findFirst());
+ earliestCommitToRetain =
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream()
+ .filter(i -> compareTimestamps(i.requestedTime(),
GREATER_THAN_OR_EQUALS, earliestTimeToRetain))
Review Comment:
This still selects the first instant `>=` the cutoff, but the PR description
says the fix is to pick the instant *at or before* the cutoff. That change
isn't here, so the sparse-commit-density bug described in the issue isn't
actually addressed — only the new pending-instant guard below is. Please
reconcile the description with the code (or add the intended change).
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/timeline/versioning/v2/TimelineArchiverV2.java:
##########
@@ -236,7 +237,11 @@ private List<HoodieInstant> getCommitInstantsToArchive()
throws IOException {
config.getCleanerPolicy());
earliestInstantToRetainCandidates.add(earliestInstantToRetainForClustering);
- // 4. If metadata table is enabled, do not archive instants which are more
recent than the last compaction on the
+ // 4. If enabled, block archival based on ECTR from the last completed
clean to ensure we don't archive
+ // commits that have data files that haven't been cleaned yet.
+
earliestInstantToRetainCandidates.add(getEarliestInstantToRetainForClean(table,
completedCommitsTimeline, config));
Review Comment:
This makes TimelineArchiverV2 (LSM/v9) block on clean ECTR, inverting the
previously-asserted behavior (the old test documented V1-only). Please confirm
V1-only was an oversight rather than a deliberate LSM restriction. Note this
also adds a new failure mode: V2 archival now aborts with `HoodieIOException`
if the clean-metadata read fails (when the config is enabled).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]