hudi-agent commented on code in PR #19041:
URL: https://github.com/apache/hudi/pull/19041#discussion_r3445495234


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/ArchivalUtils.java:
##########
@@ -110,6 +113,37 @@ public static Pair<Integer,Integer> 
getMinAndMaxInstantsToKeep(HoodieTable<?, ?,
     return Pair.of(minInstantsToKeep, maxInstantsToKeep);
   }
 
+  public static Option<HoodieInstant> getEarliestInstantToRetainForClean(
+      HoodieTable<?, ?, ?, ?> table,
+      HoodieTimeline commitTimeline,
+      HoodieWriteConfig config) {
+    if (!config.shouldBlockArchivalOnCleanECTR()) {
+      return Option.empty();
+    }
+
+    Option<HoodieInstant> lastCleanInstant = 
table.getCleanTimeline().filterCompletedInstants().lastInstant();
+    if (!lastCleanInstant.isPresent()) {
+      return Option.empty();
+    }
+
+    try {
+      HoodieCleanMetadata cleanMetadata = 
table.getActiveTimeline().readCleanMetadata(lastCleanInstant.get());
+      String earliestCommitToRetain = 
cleanMetadata.getEarliestCommitToRetain();
+      if (earliestCommitToRetain == null || 
earliestCommitToRetain.trim().isEmpty()) {
+        return Option.empty();
+      }
+
+      Option<HoodieInstant> earliestInstantToRetain =
+          
commitTimeline.findInstantsAfterOrEquals(earliestCommitToRetain).firstInstant();
+      log.info("Blocking archival based on earliest commit to retain {} from 
last clean {}. Earliest instant to retain is {}",
+          earliestCommitToRetain, lastCleanInstant.get().requestedTime(), 
earliestInstantToRetain.map(instant -> instant).orElse(null));

Review Comment:
   🤖 nit: `.map(instant -> instant)` is a no-op identity transform — could you 
drop it and use `earliestInstantToRetain.orElse(null)` directly?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##########
@@ -1059,6 +1061,20 @@ public class FlinkOptions extends HoodieConfig {
       .withFallbackKeys("hoodie.clean.fileversions.retained")
       .withDescription("Number of file versions to retain. default 5");
 
+  public static final ConfigOption<Long> CLEAN_MAX_COMMITS_TO_CLEAN = 
ConfigOptions

Review Comment:
   🤖 nit: the `_TO_CLEAN` suffix is redundant here — every other option in this 
class uses the pattern `CLEAN_<descriptor>` (e.g. `CLEAN_RETAIN_COMMITS`, 
`CLEAN_RETAIN_HOURS`) without repeating the namespace. Could you rename this to 
`CLEAN_MAX_COMMITS` to stay consistent?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/common/util/CleanerUtils.java:
##########
@@ -151,8 +152,16 @@ public static Option<HoodieInstant> 
getEarliestCommitToRetain(
     } else if (cleaningPolicy == HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS) {
       ZonedDateTime latestDateTime = ZonedDateTime.ofInstant(latestInstant, 
timeZone.getZoneId());
       String earliestTimeToRetain = 
TimelineUtils.formatDate(Date.from(latestDateTime.minusHours(hoursRetained).toInstant()));
-      earliestCommitToRetain = 
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream().filter(i 
-> compareTimestamps(i.requestedTime(),
-          GREATER_THAN_OR_EQUALS, earliestTimeToRetain)).findFirst());
+      earliestCommitToRetain = 
Option.fromJavaOptional(completedCommitsTimeline.getInstantsAsStream()
+          .filter(i -> compareTimestamps(i.requestedTime(), 
GREATER_THAN_OR_EQUALS, earliestTimeToRetain))
+          .findFirst());
+
+      Option<HoodieInstant> earliestPendingCommit = commitsTimeline.filter(s 
-> !s.isCompleted()).firstInstant();

Review Comment:
   🤖 Could you confirm the intent when `earliestCommitToRetain` is initially 
empty (no completed commit at-or-after the cutoff) but a pending instant 
exists? With this change, ECTR gets set to the last completed before the 
pending — typically the latest completed — so a quiet table with one in-flight 
commit would flip from "no cleaning" to "clean everything older than the latest 
completed". The parallel `KEEP_LATEST_COMMITS` branch avoids this because its 
pending check is nested under `countInstants > commitsRetained`. Was this 
asymmetry intended, or should the override only apply when the initial ECTR was 
already present?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to