sivabalan narayanan created HUDI-4878:
-----------------------------------------

             Summary: Fix incremental cleaning for clean based on 
LATEST_FILE_VERSIONS
                 Key: HUDI-4878
                 URL: https://issues.apache.org/jira/browse/HUDI-4878
             Project: Apache Hudi
          Issue Type: Improvement
          Components: cleaning
            Reporter: sivabalan narayanan


clean based on LATEST_FILE_VERSIONS can be improved further since incremental 
clean is not enabled. lets see if we can improvise. 

 

context from author:

 

 

Currently incremental cleaning is run for both KEEP_LATEST_COMMITS, 
KEEP_LATEST_BY_HOURS
policies. It is not run when KEEP_LATEST_FILE_VERSIONS.

This can lead to not cleaning files. This PR fixes this problem by enabling 
incremental cleaning for KEEP_LATEST_FILE_VERSIONS only.

Here is the scenario of the problem:

Say we have 3 committed files in partition-A and we add a new commit in 
partition-B, and we trigger cleaning for the first time (full partition scan):
 {{partition-A/
commit-0.parquet
commit-1.parquet
commit-2.parquet
partition-B/
commit-3.parquet}}
In the case say we have KEEP_LATEST_COMMITS with CLEANER_COMMITS_RETAINED=3, 
the cleaner will remove the commit-0.parquet to keep 3 commits.
For the next cleaning, incremental cleaning will trigger, and won't consider 
partition-A/ until a new commit change it. In case no later commit changes 
partition-A then commit-1.parquet will stay forever. However it should be 
removed by the cleaner.

Now if in case of KEEP_LATEST_FILE_VERSIONS, the cleaner will only keep 
commit-2.parquet. Then it makes sense that incremental cleaning won't consider 
partition-A until it is changed. Because there is only one commit.

This is why incremental cleaning should only be enabled with 
KEEP_LATEST_FILE_VERSIONS

Hope this is clear enough

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to