yihua commented on code in PR #18380:
URL: https://github.com/apache/hudi/pull/18380#discussion_r3035843495


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/timeline/versioning/v1/TimelineArchiverV1.java:
##########
@@ -271,7 +271,30 @@ private Stream<HoodieInstant> getCommitInstantsToArchive() 
throws IOException {
       Option<HoodieInstant> oldestInstantToRetainForClustering =
           
ClusteringUtils.getEarliestInstantToRetainForClustering(table.getActiveTimeline(),
 table.getMetaClient(), config.getCleanerPolicy());
 
+      // If enabled, block archival based on ECTR from the last completed 
clean to ensure we don't archive
+      // commits that have data files that haven't been cleaned yet.
+      Option<HoodieInstant> oldestInstantToRetainForClean = Option.empty();
+      if (config.shouldBlockArchivalOnCleanECTR()) {
+        Option<HoodieInstant> lastCleanInstant = 
table.getCleanTimeline().filterCompletedInstants().lastInstant();

Review Comment:
   🤖 `table.getCleanTimeline()` delegates to 
`getActiveTimeline().getCleanerTimeline()`, which only returns cleans on the 
active timeline. If the last completed clean has itself been archived already, 
`lastCleanInstant` will be empty and the ECTR check is silently skipped. Could 
you verify this can't happen, or consider also checking the archived timeline 
for the most recent clean?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/timeline/versioning/v1/TimelineArchiverV1.java:
##########
@@ -271,7 +271,30 @@ private Stream<HoodieInstant> getCommitInstantsToArchive() 
throws IOException {
       Option<HoodieInstant> oldestInstantToRetainForClustering =
           
ClusteringUtils.getEarliestInstantToRetainForClustering(table.getActiveTimeline(),
 table.getMetaClient(), config.getCleanerPolicy());
 
+      // If enabled, block archival based on ECTR from the last completed 
clean to ensure we don't archive
+      // commits that have data files that haven't been cleaned yet.
+      Option<HoodieInstant> oldestInstantToRetainForClean = Option.empty();
+      if (config.shouldBlockArchivalOnCleanECTR()) {
+        Option<HoodieInstant> lastCleanInstant = 
table.getCleanTimeline().filterCompletedInstants().lastInstant();
+        if (lastCleanInstant.isPresent()) {
+          try {
+            org.apache.hudi.avro.model.HoodieCleanMetadata cleanMetadata =
+                
table.getActiveTimeline().readCleanMetadata(lastCleanInstant.get());
+            if (cleanMetadata.getEarliestCommitToRetain() != null
+                && 
!cleanMetadata.getEarliestCommitToRetain().trim().isEmpty()) {
+              oldestInstantToRetainForClean = 
commitTimeline.findInstantsAfterOrEquals(
+                  cleanMetadata.getEarliestCommitToRetain()).firstInstant();
+              log.info("Blocking archival based on ECTR {} from last clean {}",
+                  cleanMetadata.getEarliestCommitToRetain(), 
lastCleanInstant.get().requestedTime());
+            }

Review Comment:
   🤖 This catch block silently skips the ECTR check on IOException, which means 
archival proceeds without the safety guard. Since the whole purpose of this 
feature is to prevent data leaks from archiving commits with uncleaned data 
files, shouldn't this fail closed (i.e., throw or block archival) rather than 
fail open? A transient IO error here could cause exactly the data leak the 
feature is designed to prevent.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to