[I] Introduce Hybrid Cleaner policy based on both LATEST_COMMITS and LATEST_FILE_VERSIONS [hudi]

via GitHub Sat, 29 Nov 2025 21:48:14 -0800


hudi-bot opened a new issue, #15400:
URL: https://github.com/apache/hudi/issues/15400

We have two major cleaner policies. LATEST_COMMITS and LATEST_FILE_VERSIONS.
among this, LATEST_COMMITS might be
[[efficient|https://github.com/apache/hudi/blob/570989dc4011d5370f60991d8782408dd0f72c07/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L177]],
bcoz, we maintain earliest retained commit and will read into new commits
(commit metadata) after the earliest retained to find the partitions that might
be eligible for cleaning.

with LATEST_FILE_VERSIONS, we can't do any such optimization. So, we always
end up doing [full listing|#L207].]

As you can imagine, for larger tables w/ huge no of partitions, this might
have a hit on the perf. So, wondering if we can introduce a hybrid cleaner
policy which combines both.

For eg, we will do cleaning based on LATEST_FILE_VERSIONS until Nth
commit(say 10). And every 10th commit, we will trigger cleaner based on
LATEST_COMMITS. so, that from 11th commit until 20th, we can do poll commits
after 10 to get the list of partitions to clean instead of doing a full
listing.

## JIRA info

- Link: https://issues.apache.org/jira/browse/HUDI-4750
- Type: Improvement

---

## Comments

03/Sep/22 14:02;zouxxyy;Maybe we can directly increase the default value of
`hoodie.clean.max.commits` ? Currently it's 1, which I feel is a bit too fast;;;

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Introduce Hybrid Cleaner policy based on both LATEST_COMMITS and LATEST_FILE_VERSIONS [hudi]

Reply via email to