[ 
https://issues.apache.org/jira/browse/HUDI-4750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4750:
--------------------------------------
    Description: 
We have two major cleaner policies. LATEST_COMMITS and LATEST_FILE_VERSIONS. 
among this, LATEST_COMMITS might be 
[[efficient|https://github.com/apache/hudi/blob/570989dc4011d5370f60991d8782408dd0f72c07/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L177]],
 bcoz, we maintain earliest retained commit and will read into new commits 
(commit metadata) after the earliest retained to find the partitions that might 
be eligible for cleaning. 

with LATEST_FILE_VERSIONS, we can't do any such optimization. So, we always end 
up doing [full listing|#L207].]

 

As you can imagine, for larger tables w/ huge no of partitions, this might have 
a hit on the perf. So, wondering if we can introduce a hybrid cleaner policy 
which combines both. 

For eg, we will do cleaning based on LATEST_FILE_VERSIONS until Nth commit(say 
10). And every 10th commit, we will trigger cleaner based on LATEST_COMMITS. 
so, that from 11th commit until 20th, we can do poll commits after 10 to get 
the list of partitions to clean instead of doing a full listing. 

 

  was:
We have two major cleaner policies. LATEST_COMMITS and LATEST_FILE_VERSIONS. 
among this, LATEST_COMMITS might be 
[[efficient|https://github.com/apache/hudi/blob/570989dc4011d5370f60991d8782408dd0f72c07/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L177]],
 bcoz, we maintain earliest retained commit and will read into new commits 
(commit metadata) after the earliest retained to find the partitions that might 
be eligible for cleaning. 

with LATEST_FILE_VERSIONS, we can't do any such optimization. So, we always end 
up doing [full 
listing|[https://github.com/apache/hudi/blob/570989dc4011d5370f60991d8782408dd0f72c07/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L207].]

 

As you can imagine, for larger tables w/ huge no of partitions, this might have 
a hit on the perf. So, wondering if we can introduce a hybrid cleaner policy 
which combines both. 

For eg, we will do cleaning based on LATEST_FILE_VERSIONS until Nth commit(say 
10). And every 10th commit, we will trigger cleaner based on LATEST_COMMITS. 
so, that from 11th commit until 20th, we can do poll commits after 10 to get 
the list of partitions to clean instead of doing a full listing. 

 

 

 

 

 

 


> Introduce Hybrid Cleaner policy based on both LATEST_COMMITS and 
> LATEST_FILE_VERSIONS
> -------------------------------------------------------------------------------------
>
>                 Key: HUDI-4750
>                 URL: https://issues.apache.org/jira/browse/HUDI-4750
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: cleaning
>            Reporter: sivabalan narayanan
>            Assignee: sivabalan narayanan
>            Priority: Major
>
> We have two major cleaner policies. LATEST_COMMITS and LATEST_FILE_VERSIONS. 
> among this, LATEST_COMMITS might be 
> [[efficient|https://github.com/apache/hudi/blob/570989dc4011d5370f60991d8782408dd0f72c07/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L177]],
>  bcoz, we maintain earliest retained commit and will read into new commits 
> (commit metadata) after the earliest retained to find the partitions that 
> might be eligible for cleaning. 
> with LATEST_FILE_VERSIONS, we can't do any such optimization. So, we always 
> end up doing [full listing|#L207].]
>  
> As you can imagine, for larger tables w/ huge no of partitions, this might 
> have a hit on the perf. So, wondering if we can introduce a hybrid cleaner 
> policy which combines both. 
> For eg, we will do cleaning based on LATEST_FILE_VERSIONS until Nth 
> commit(say 10). And every 10th commit, we will trigger cleaner based on 
> LATEST_COMMITS. so, that from 11th commit until 20th, we can do poll commits 
> after 10 to get the list of partitions to clean instead of doing a full 
> listing. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to