hudi-bot opened a new issue, #15442:
URL: https://github.com/apache/hudi/issues/15442
clean based on LATEST_FILE_VERSIONS can be improved further since
incremental clean is not enabled. lets see if we can improvise.
context from author:
Currently incremental cleaning is run for both KEEP_LATEST_COMMITS,
KEEP_LATEST_BY_HOURS
policies. It is not run when KEEP_LATEST_FILE_VERSIONS.
This can lead to not cleaning files. This PR fixes this problem by enabling
incremental cleaning for KEEP_LATEST_FILE_VERSIONS only.
Here is the scenario of the problem:
Say we have 3 committed files in partition-A and we add a new commit in
partition-B, and we trigger cleaning for the first time (full partition scan):
{{partition-A/
commit-0.parquet
commit-1.parquet
commit-2.parquet
partition-B/
commit-3.parquet}}
In the case say we have KEEP_LATEST_COMMITS with CLEANER_COMMITS_RETAINED=3,
the cleaner will remove the commit-0.parquet to keep 3 commits.
For the next cleaning, incremental cleaning will trigger, and won't consider
partition-A/ until a new commit change it. In case no later commit changes
partition-A then commit-1.parquet will stay forever. However it should be
removed by the cleaner.
Now if in case of KEEP_LATEST_FILE_VERSIONS, the cleaner will only keep
commit-2.parquet. Then it makes sense that incremental cleaning won't consider
partition-A until it is changed. Because there is only one commit.
This is why incremental cleaning should only be enabled with
KEEP_LATEST_FILE_VERSIONS
Hope this is clear enough
## JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-4878
- Type: Improvement
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]