danny0405 commented on code in PR #11605:
URL: https://github.com/apache/hudi/pull/11605#discussion_r1673232619


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java:
##########
@@ -97,6 +99,64 @@ private boolean needsCleaning(CleaningTriggerStrategy 
strategy) {
     }
   }
 
+  private HoodieCleanerPlan getEmptyCleanerPlan(Option<HoodieInstant> 
earliestInstant, CleanPlanner<T, I, K, O> planner) throws IOException {
+    LOG.info("Nothing to clean here. It is already clean");
+    Option<HoodieInstant> instantVal = 
getEarliestCommitToRetainToCreateEmptyCleanPlan(earliestInstant);
+    HoodieCleanerPlan.Builder cleanBuilder = HoodieCleanerPlan.newBuilder();
+    instantVal.map(x -> 
cleanBuilder.setPolicy(config.getCleanerPolicy().name())
+        .setVersion(CleanPlanner.LATEST_CLEAN_PLAN_VERSION)
+        .setEarliestInstantToRetain(new HoodieActionInstant(x.getTimestamp(), 
x.getAction(), x.getState().name()))
+        
.setLastCompletedCommitTimestamp(planner.getLastCompletedCommitTimestamp())
+    
).orElse(cleanBuilder.setPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name()));
+    return cleanBuilder.build();
+  }
+
+  /**
+   * There is nothing to clean so this method will create an empty clean plan. 
But there is chance of optimizing
+   * the subsequent cleaner calls. Consider this scenarios in incremental 
cleaner mode,
+   * If clean timeline is empty or no clean commits were created for a while 
then every clean call will have to
+   * scan all the partitions, by creating an empty clean commit to update 
earliestCommitToRetain instant value,
+   * incremental clean policy does not have to look for file changes in all 
the partitions, rather it will look
+   * for partitions that are modified in last x hours. This value is 
configured through MAX_DURATION_TO_CREATE_EMPTY_CLEAN.

Review Comment:
   > Also, we cannot disable cleans for such datasets, since those datasets can 
be onboarded to clustering and can do adhoc rewrites on some partitions. So, 
clean have to be enabled all the time and when we are enabling cleans it is 
scanning all the partitions. So, we thought of storing the 
earliestCommitToRetain value somewhere and use it to store the commits it has 
scanned. 
   
   Can we do force clean in the rewrite jobs or store the last 
`earliestCommitToRetain` separately in the metadata, instead of empty clean 
commits, like as you mentioned, most of the clean commits would just be useless 
and empty for this special inserts scenario. I kind of think we should come up 
with a more general solution for append only use case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to