hudi-bot opened a new issue, #15400: URL: https://github.com/apache/hudi/issues/15400
We have two major cleaner policies. LATEST_COMMITS and LATEST_FILE_VERSIONS. among this, LATEST_COMMITS might be [[efficient|https://github.com/apache/hudi/blob/570989dc4011d5370f60991d8782408dd0f72c07/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L177]], bcoz, we maintain earliest retained commit and will read into new commits (commit metadata) after the earliest retained to find the partitions that might be eligible for cleaning. with LATEST_FILE_VERSIONS, we can't do any such optimization. So, we always end up doing [full listing|#L207].] As you can imagine, for larger tables w/ huge no of partitions, this might have a hit on the perf. So, wondering if we can introduce a hybrid cleaner policy which combines both. For eg, we will do cleaning based on LATEST_FILE_VERSIONS until Nth commit(say 10). And every 10th commit, we will trigger cleaner based on LATEST_COMMITS. so, that from 11th commit until 20th, we can do poll commits after 10 to get the list of partitions to clean instead of doing a full listing. ## JIRA info - Link: https://issues.apache.org/jira/browse/HUDI-4750 - Type: Improvement --- ## Comments 03/Sep/22 14:02;zouxxyy;Maybe we can directly increase the default value of `hoodie.clean.max.commits` ? Currently it's 1, which I feel is a bit too fast;;; -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
