KnightChess commented on code in PR #11605:
URL: https://github.com/apache/hudi/pull/11605#discussion_r1693728107
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java:
##########
@@ -97,6 +99,64 @@ private boolean needsCleaning(CleaningTriggerStrategy
strategy) {
}
}
+ private HoodieCleanerPlan getEmptyCleanerPlan(Option<HoodieInstant>
earliestInstant, CleanPlanner<T, I, K, O> planner) throws IOException {
+ LOG.info("Nothing to clean here. It is already clean");
+ Option<HoodieInstant> instantVal =
getEarliestCommitToRetainToCreateEmptyCleanPlan(earliestInstant);
+ HoodieCleanerPlan.Builder cleanBuilder = HoodieCleanerPlan.newBuilder();
+ instantVal.map(x ->
cleanBuilder.setPolicy(config.getCleanerPolicy().name())
+ .setVersion(CleanPlanner.LATEST_CLEAN_PLAN_VERSION)
+ .setEarliestInstantToRetain(new HoodieActionInstant(x.getTimestamp(),
x.getAction(), x.getState().name()))
+
.setLastCompletedCommitTimestamp(planner.getLastCompletedCommitTimestamp())
+
).orElse(cleanBuilder.setPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name()));
+ return cleanBuilder.build();
+ }
+
+ /**
+ * There is nothing to clean so this method will create an empty clean plan.
But there is chance of optimizing
+ * the subsequent cleaner calls. Consider this scenarios in incremental
cleaner mode,
+ * If clean timeline is empty or no clean commits were created for a while
then every clean call will have to
+ * scan all the partitions, by creating an empty clean commit to update
earliestCommitToRetain instant value,
+ * incremental clean policy does not have to look for file changes in all
the partitions, rather it will look
+ * for partitions that are modified in last x hours. This value is
configured through MAX_DURATION_TO_CREATE_EMPTY_CLEAN.
Review Comment:
hive meta sync also use `last_commit_completion_time_sync` to reduce scan
all partition path, it store in hms.
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java:
##########
@@ -97,6 +99,64 @@ private boolean needsCleaning(CleaningTriggerStrategy
strategy) {
}
}
+ private HoodieCleanerPlan getEmptyCleanerPlan(Option<HoodieInstant>
earliestInstant, CleanPlanner<T, I, K, O> planner) throws IOException {
+ LOG.info("Nothing to clean here. It is already clean");
+ Option<HoodieInstant> instantVal =
getEarliestCommitToRetainToCreateEmptyCleanPlan(earliestInstant);
+ HoodieCleanerPlan.Builder cleanBuilder = HoodieCleanerPlan.newBuilder();
+ instantVal.map(x ->
cleanBuilder.setPolicy(config.getCleanerPolicy().name())
+ .setVersion(CleanPlanner.LATEST_CLEAN_PLAN_VERSION)
+ .setEarliestInstantToRetain(new HoodieActionInstant(x.getTimestamp(),
x.getAction(), x.getState().name()))
+
.setLastCompletedCommitTimestamp(planner.getLastCompletedCommitTimestamp())
+
).orElse(cleanBuilder.setPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name()));
+ return cleanBuilder.build();
+ }
+
+ /**
+ * There is nothing to clean so this method will create an empty clean plan.
But there is chance of optimizing
+ * the subsequent cleaner calls. Consider this scenarios in incremental
cleaner mode,
+ * If clean timeline is empty or no clean commits were created for a while
then every clean call will have to
+ * scan all the partitions, by creating an empty clean commit to update
earliestCommitToRetain instant value,
+ * incremental clean policy does not have to look for file changes in all
the partitions, rather it will look
+ * for partitions that are modified in last x hours. This value is
configured through MAX_DURATION_TO_CREATE_EMPTY_CLEAN.
Review Comment:
hive meta sync also use `last_commit_completion_time_sync` to reduce scan
all partition path, it store in hms.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]