sivabalan narayanan created HUDI-4750:
-----------------------------------------

             Summary: Introduce Hybrid Cleaner policy based on both 
LATEST_COMMITS and LATEST_FILE_VERSIONS
                 Key: HUDI-4750
                 URL: https://issues.apache.org/jira/browse/HUDI-4750
             Project: Apache Hudi
          Issue Type: Improvement
          Components: cleaning
            Reporter: sivabalan narayanan


We have two major cleaner policies. LATEST_COMMITS and LATEST_FILE_VERSIONS. 
among this, LATEST_COMMITS might be 
[[efficient|https://github.com/apache/hudi/blob/570989dc4011d5370f60991d8782408dd0f72c07/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L177]],
 bcoz, we maintain earliest retained commit and will read into new commits 
(commit metadata) after the earliest retained to find the partitions that might 
be eligible for cleaning. 

with LATEST_FILE_VERSIONS, we can't do any such optimization. So, we always end 
up doing [full 
listing|[https://github.com/apache/hudi/blob/570989dc4011d5370f60991d8782408dd0f72c07/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L207].]

 

As you can imagine, for larger tables w/ huge no of partitions, this might have 
a hit on the perf. So, wondering if we can introduce a hybrid cleaner policy 
which combines both. 

For eg, we will do cleaning based on LATEST_FILE_VERSIONS until Nth commit(say 
10). And every 10th commit, we will trigger cleaner based on LATEST_COMMITS. 
so, that from 11th commit until 20th, we can do poll commits after 10 to get 
the list of partitions to clean instead of doing a full listing. 

 

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to