hudi-bot opened a new issue, #15400:
URL: https://github.com/apache/hudi/issues/15400

   We have two major cleaner policies. LATEST_COMMITS and LATEST_FILE_VERSIONS. 
among this, LATEST_COMMITS might be 
[[efficient|https://github.com/apache/hudi/blob/570989dc4011d5370f60991d8782408dd0f72c07/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L177]],
 bcoz, we maintain earliest retained commit and will read into new commits 
(commit metadata) after the earliest retained to find the partitions that might 
be eligible for cleaning. 
   
   with LATEST_FILE_VERSIONS, we can't do any such optimization. So, we always 
end up doing [full listing|#L207].]
   
    
   
   As you can imagine, for larger tables w/ huge no of partitions, this might 
have a hit on the perf. So, wondering if we can introduce a hybrid cleaner 
policy which combines both. 
   
   For eg, we will do cleaning based on LATEST_FILE_VERSIONS until Nth 
commit(say 10). And every 10th commit, we will trigger cleaner based on 
LATEST_COMMITS. so, that from 11th commit until 20th, we can do poll commits 
after 10 to get the list of partitions to clean instead of doing a full 
listing. 
   
    
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-4750
   - Type: Improvement
   
   
   ---
   
   
   ## Comments
   
   03/Sep/22 14:02;zouxxyy;Maybe we can directly increase the default value of 
`hoodie.clean.max.commits` ? Currently it's 1, which I feel is a bit too fast;;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to