sivabalan narayanan created HUDI-4750:
-----------------------------------------
Summary: Introduce Hybrid Cleaner policy based on both
LATEST_COMMITS and LATEST_FILE_VERSIONS
Key: HUDI-4750
URL: https://issues.apache.org/jira/browse/HUDI-4750
Project: Apache Hudi
Issue Type: Improvement
Components: cleaning
Reporter: sivabalan narayanan
We have two major cleaner policies. LATEST_COMMITS and LATEST_FILE_VERSIONS.
among this, LATEST_COMMITS might be
[[efficient|https://github.com/apache/hudi/blob/570989dc4011d5370f60991d8782408dd0f72c07/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L177]],
bcoz, we maintain earliest retained commit and will read into new commits
(commit metadata) after the earliest retained to find the partitions that might
be eligible for cleaning.
with LATEST_FILE_VERSIONS, we can't do any such optimization. So, we always end
up doing [full
listing|[https://github.com/apache/hudi/blob/570989dc4011d5370f60991d8782408dd0f72c07/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L207].]
As you can imagine, for larger tables w/ huge no of partitions, this might have
a hit on the perf. So, wondering if we can introduce a hybrid cleaner policy
which combines both.
For eg, we will do cleaning based on LATEST_FILE_VERSIONS until Nth commit(say
10). And every 10th commit, we will trigger cleaner based on LATEST_COMMITS.
so, that from 11th commit until 20th, we can do poll commits after 10 to get
the list of partitions to clean instead of doing a full listing.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)