[
https://issues.apache.org/jira/browse/HUDI-4750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599934#comment-17599934
]
zouxxyy commented on HUDI-4750:
-------------------------------
Maybe we can directly increase the default value of `hoodie.clean.max.commits`
? Currently it's 1, which I feel is a bit too fast
> Introduce Hybrid Cleaner policy based on both LATEST_COMMITS and
> LATEST_FILE_VERSIONS
> -------------------------------------------------------------------------------------
>
> Key: HUDI-4750
> URL: https://issues.apache.org/jira/browse/HUDI-4750
> Project: Apache Hudi
> Issue Type: Improvement
> Components: cleaning
> Reporter: sivabalan narayanan
> Assignee: sivabalan narayanan
> Priority: Major
> Fix For: 0.13.0
>
>
> We have two major cleaner policies. LATEST_COMMITS and LATEST_FILE_VERSIONS.
> among this, LATEST_COMMITS might be
> [[efficient|https://github.com/apache/hudi/blob/570989dc4011d5370f60991d8782408dd0f72c07/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L177]],
> bcoz, we maintain earliest retained commit and will read into new commits
> (commit metadata) after the earliest retained to find the partitions that
> might be eligible for cleaning.
> with LATEST_FILE_VERSIONS, we can't do any such optimization. So, we always
> end up doing [full listing|#L207].]
>
> As you can imagine, for larger tables w/ huge no of partitions, this might
> have a hit on the perf. So, wondering if we can introduce a hybrid cleaner
> policy which combines both.
> For eg, we will do cleaning based on LATEST_FILE_VERSIONS until Nth
> commit(say 10). And every 10th commit, we will trigger cleaner based on
> LATEST_COMMITS. so, that from 11th commit until 20th, we can do poll commits
> after 10 to get the list of partitions to clean instead of doing a full
> listing.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)