[jira] [Commented] (HUDI-4750) Introduce Hybrid Cleaner policy based on both LATEST_COMMITS and LATEST_FILE_VERSIONS

zouxxyy (Jira) Sat, 03 Sep 2022 07:03:05 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-4750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599934#comment-17599934
 ]


zouxxyy commented on HUDI-4750:
-------------------------------

Maybe we can directly increase the default value of `hoodie.clean.max.commits` 
? Currently it's 1, which I feel is a bit too fast

> Introduce Hybrid Cleaner policy based on both LATEST_COMMITS and 
> LATEST_FILE_VERSIONS
> -------------------------------------------------------------------------------------
>
>                 Key: HUDI-4750
>                 URL: https://issues.apache.org/jira/browse/HUDI-4750
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: cleaning
>            Reporter: sivabalan narayanan
>            Assignee: sivabalan narayanan
>            Priority: Major
>             Fix For: 0.13.0
>
>
> We have two major cleaner policies. LATEST_COMMITS and LATEST_FILE_VERSIONS. 
> among this, LATEST_COMMITS might be 
> [[efficient|https://github.com/apache/hudi/blob/570989dc4011d5370f60991d8782408dd0f72c07/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L177]],
>  bcoz, we maintain earliest retained commit and will read into new commits 
> (commit metadata) after the earliest retained to find the partitions that 
> might be eligible for cleaning. 
> with LATEST_FILE_VERSIONS, we can't do any such optimization. So, we always 
> end up doing [full listing|#L207].]
>  
> As you can imagine, for larger tables w/ huge no of partitions, this might 
> have a hit on the perf. So, wondering if we can introduce a hybrid cleaner 
> policy which combines both. 
> For eg, we will do cleaning based on LATEST_FILE_VERSIONS until Nth 
> commit(say 10). And every 10th commit, we will trigger cleaner based on 
> LATEST_COMMITS. so, that from 11th commit until 20th, we can do poll commits 
> after 10 to get the list of partitions to clean instead of doing a full 
> listing. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-4750) Introduce Hybrid Cleaner policy based on both LATEST_COMMITS and LATEST_FILE_VERSIONS

Reply via email to