pratyakshsharma commented on code in PR #7041:
URL: https://github.com/apache/hudi/pull/7041#discussion_r1061583585
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java:
##########
@@ -488,13 +497,14 @@ public Option<HoodieInstant> getEarliestCommitToRetain() {
int hoursRetained = config.getCleanerHoursRetained();
if (config.getCleanerPolicy() == HoodieCleaningPolicy.KEEP_LATEST_COMMITS
&& commitTimeline.countInstants() > commitsRetained) {
- earliestCommitToRetain =
commitTimeline.nthInstant(commitTimeline.countInstants() - commitsRetained);
//15 instants total, 10 commits to retain, this gives 6th instant in the list
+ earliestCommitToRetain =
+ commitTimeline.nthInstant(commitTimeline.countInstants() -
commitsRetained); //15 instants total, 10 commits to retain, this gives 6th
instant in the list- commitsRetained, 0));
Review Comment:
Can you remove the changes in this comment here? Are they needed?
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java:
##########
@@ -146,10 +179,12 @@ HoodieCleanerPlan requestClean(HoodieEngineContext
context) {
*/
protected Option<HoodieCleanerPlan> requestClean(String startCleanTime) {
final HoodieCleanerPlan cleanerPlan = requestClean(context);
- if ((cleanerPlan.getFilePathsToBeDeletedPerPartition() != null)
+ // Create a clean request contains the cleaner plan if:
+ // - ALLOW_EMPTY_CLEAN_COMMITS is true
+ // - or the list of the file paths to be deleted is not empty
+ if (config.allowEmptyCleanCommits() ||
(cleanerPlan.getFilePathsToBeDeletedPerPartition() != null
Review Comment:
we should probably add another check to validate that
ALLOW_EMPTY_CLEAN_COMMITS is set to true only if incremental cleaning is
enabled.
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java:
##########
@@ -38,10 +39,15 @@
import org.apache.hudi.exception.HoodieIOException;
import org.apache.hudi.table.HoodieTable;
import org.apache.hudi.table.action.BaseActionExecutor;
+
Review Comment:
nit: remove the extra line.
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java:
##########
@@ -187,9 +187,18 @@ private List<String>
getPartitionPathsForIncrementalCleaning(HoodieCleanMetadata
LOG.info("Incremental Cleaning mode is enabled. Looking up partition-paths
that have since changed "
+ "since last cleaned at " + cleanMetadata.getEarliestCommitToRetain()
+ ". New Instant to retain : " + newInstantToRetain);
+ String commitJustBeforeEarliestCommitToRetain = null;
Review Comment:
Just wondering if these changes in this method are actually needed. Even if
we keep the original logic, I guess we are not missing out on any scenario.
Please correct me if I am missing anything here. Ultimately you are only
getting the lastCheckedCommit and that is already taken care of in the original
logic.
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java:
##########
@@ -86,7 +92,29 @@ private boolean needsCleaning(CleaningTriggerStrategy
strategy) {
if (strategy == CleaningTriggerStrategy.NUM_COMMITS) {
int numberOfCommits = getCommitsSinceLastCleaning();
int maxInlineCommitsForNextClean = config.getCleaningMaxCommits();
- return numberOfCommits >= maxInlineCommitsForNextClean;
+ if (numberOfCommits >= maxInlineCommitsForNextClean) {
+ // check if the number of commits created after the last clean is
greater than clean.max.commits
+ int commitsRetained = config.getCleanerCommitsRetained();
+ int hoursRetained = config.getCleanerHoursRetained();
+ if (config.getCleanerPolicy() ==
HoodieCleaningPolicy.KEEP_LATEST_COMMITS) {
+ // if cleaner policy is KEEP_LATEST_COMMITS then
+ // check if the number of completed commits in the timeline is
greater than cleaner.commits.retained
+ return table.getCompletedCommitsTimeline().countInstants() >
commitsRetained;
Review Comment:
These checks are already present in CleanPlanner here -
https://github.com/apache/hudi/blob/a0df6ecfb47ac9e393b5eb977850cb8fbe7d3f72/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L313.
The new changes are redundant and we should remove the checks from any one
place. I will leave this to you to decide. I feel we should remove the check
from the place I mentioned since this makes more sense to keep them in this
method here.
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java:
##########
@@ -86,7 +92,29 @@ private boolean needsCleaning(CleaningTriggerStrategy
strategy) {
if (strategy == CleaningTriggerStrategy.NUM_COMMITS) {
int numberOfCommits = getCommitsSinceLastCleaning();
int maxInlineCommitsForNextClean = config.getCleaningMaxCommits();
- return numberOfCommits >= maxInlineCommitsForNextClean;
+ if (numberOfCommits >= maxInlineCommitsForNextClean) {
+ // check if the number of commits created after the last clean is
greater than clean.max.commits
+ int commitsRetained = config.getCleanerCommitsRetained();
+ int hoursRetained = config.getCleanerHoursRetained();
+ if (config.getCleanerPolicy() ==
HoodieCleaningPolicy.KEEP_LATEST_COMMITS) {
+ // if cleaner policy is KEEP_LATEST_COMMITS then
+ // check if the number of completed commits in the timeline is
greater than cleaner.commits.retained
+ return table.getCompletedCommitsTimeline().countInstants() >
commitsRetained;
+ } else if (config.getCleanerPolicy() ==
HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS) {
+ // if cleaner policy is KEEP_LATEST_BY_HOURS then
+ // check if there is a commit with timestamp older than current
instant - cleaner.hours.retained
+ Instant instant = Instant.now();
+ ZonedDateTime currentDateTime = ZonedDateTime.ofInstant(instant,
ZoneId.systemDefault());
+ String earliestTimeToRetain =
HoodieActiveTimeline.formatDate(Date.from(currentDateTime.minusHours(hoursRetained).toInstant()));
+ return
table.getCompletedCommitsTimeline().getInstantsAsStream().filter(i ->
HoodieTimeline.compareTimestamps(i.getTimestamp(),
+ HoodieTimeline.LESSER_THAN, earliestTimeToRetain)).count() > 0;
+ } else if (config.getCleanerPolicy() ==
HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS) {
+ // if cleaner policy is KEEP_LATEST_BY_HOURS then
Review Comment:
nit: change the cleaner policy to KEEP_LATEST_FILE_VERSIONS.
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java:
##########
@@ -146,10 +179,12 @@ HoodieCleanerPlan requestClean(HoodieEngineContext
context) {
*/
protected Option<HoodieCleanerPlan> requestClean(String startCleanTime) {
final HoodieCleanerPlan cleanerPlan = requestClean(context);
- if ((cleanerPlan.getFilePathsToBeDeletedPerPartition() != null)
+ // Create a clean request contains the cleaner plan if:
Review Comment:
nit: Create a clean request -> New clean request
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]