suryaprasanna commented on code in PR #11605:
URL: https://github.com/apache/hudi/pull/11605#discussion_r1673110615


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java:
##########
@@ -97,6 +99,64 @@ private boolean needsCleaning(CleaningTriggerStrategy 
strategy) {
     }
   }
 
+  private HoodieCleanerPlan getEmptyCleanerPlan(Option<HoodieInstant> 
earliestInstant, CleanPlanner<T, I, K, O> planner) throws IOException {
+    LOG.info("Nothing to clean here. It is already clean");
+    Option<HoodieInstant> instantVal = 
getEarliestCommitToRetainToCreateEmptyCleanPlan(earliestInstant);
+    HoodieCleanerPlan.Builder cleanBuilder = HoodieCleanerPlan.newBuilder();
+    instantVal.map(x -> 
cleanBuilder.setPolicy(config.getCleanerPolicy().name())
+        .setVersion(CleanPlanner.LATEST_CLEAN_PLAN_VERSION)
+        .setEarliestInstantToRetain(new HoodieActionInstant(x.getTimestamp(), 
x.getAction(), x.getState().name()))
+        
.setLastCompletedCommitTimestamp(planner.getLastCompletedCommitTimestamp())
+    
).orElse(cleanBuilder.setPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name()));
+    return cleanBuilder.build();
+  }
+
+  /**
+   * There is nothing to clean so this method will create an empty clean plan. 
But there is chance of optimizing
+   * the subsequent cleaner calls. Consider this scenarios in incremental 
cleaner mode,
+   * If clean timeline is empty or no clean commits were created for a while 
then every clean call will have to
+   * scan all the partitions, by creating an empty clean commit to update 
earliestCommitToRetain instant value,
+   * incremental clean policy does not have to look for file changes in all 
the partitions, rather it will look
+   * for partitions that are modified in last x hours. This value is 
configured through MAX_DURATION_TO_CREATE_EMPTY_CLEAN.

Review Comment:
   @danny0405 
   The usecase we are trying to solve here is when a dataset that does not 
receive upserts and does not use small handling, cleans are running for every 
commit and scans all the partitions all the time. Also, we cannot disable 
cleans for such datasets, since those datasets can be onboarded to clustering 
and can do adhoc rewrites on some partitions. So, clean have to be enabled all 
the time and when we are enabling cleans it is scanning all the partitions. So, 
we thought of storing the earliestCommitToRetain value somewhere and use it to 
store the commits it has scanned. That way every clean will only scan the 
partitions that were touched in a commit since earliestCommitToRetain progress 
commit by commit when incremental clean policy is enabled. This reduces the 
full scan for all partitions to scanning one or two partitions.
   CC @nsivabalan 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to