[
https://issues.apache.org/jira/browse/HUDI-4878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
sivabalan narayanan updated HUDI-4878:
--------------------------------------
Sprint: 2022/09/05, 2022/10/04 (was: 2022/09/05)
> Fix incremental cleaning for clean based on LATEST_FILE_VERSIONS
> ----------------------------------------------------------------
>
> Key: HUDI-4878
> URL: https://issues.apache.org/jira/browse/HUDI-4878
> Project: Apache Hudi
> Issue Type: Improvement
> Components: cleaning
> Reporter: sivabalan narayanan
> Assignee: nicolas paris
> Priority: Blocker
> Labels: pull-request-available
>
> clean based on LATEST_FILE_VERSIONS can be improved further since incremental
> clean is not enabled. lets see if we can improvise.
>
> context from author:
>
>
> Currently incremental cleaning is run for both KEEP_LATEST_COMMITS,
> KEEP_LATEST_BY_HOURS
> policies. It is not run when KEEP_LATEST_FILE_VERSIONS.
> This can lead to not cleaning files. This PR fixes this problem by enabling
> incremental cleaning for KEEP_LATEST_FILE_VERSIONS only.
> Here is the scenario of the problem:
> Say we have 3 committed files in partition-A and we add a new commit in
> partition-B, and we trigger cleaning for the first time (full partition scan):
> {{partition-A/
> commit-0.parquet
> commit-1.parquet
> commit-2.parquet
> partition-B/
> commit-3.parquet}}
> In the case say we have KEEP_LATEST_COMMITS with CLEANER_COMMITS_RETAINED=3,
> the cleaner will remove the commit-0.parquet to keep 3 commits.
> For the next cleaning, incremental cleaning will trigger, and won't consider
> partition-A/ until a new commit change it. In case no later commit changes
> partition-A then commit-1.parquet will stay forever. However it should be
> removed by the cleaner.
> Now if in case of KEEP_LATEST_FILE_VERSIONS, the cleaner will only keep
> commit-2.parquet. Then it makes sense that incremental cleaning won't
> consider partition-A until it is changed. Because there is only one commit.
> This is why incremental cleaning should only be enabled with
> KEEP_LATEST_FILE_VERSIONS
> Hope this is clear enough
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)