parisni opened a new pull request, #6498: URL: https://github.com/apache/hudi/pull/6498
### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact Currently incremental cleaning is run for both KEEP_LATEST_COMMITS, KEEP_LATEST_BY_HOURS policies. It is not run when KEEP_LATEST_FILE_VERSIONS. This can lead to not cleaning files. This PR fixes this problem by enabling incremental cleaning for KEEP_LATEST_FILE_VERSIONS only. Here is the scenario of the problem: Say we have 3 commited files in a given partition and we add a new commit in partition-B, and we trigger cleaning for the first time: ``` partition-A/ commit-0.parquet commit-1.parquet commit-2.parquet partition-B/ commit-3.parquet ``` In the case say we chosed KEEP_LATEST_COMMITS with CLEANER_COMMITS_RETAINED=3, the cleaner will remove the commit-0.parquet to keep 3 commits. For the next cleaning, incremental cleaning will trigger, and won't consider partition-A/ until a new commit change it. In case no later commit changes partition-A then commit-1.parquet will stay forever. However it should be removed by the cleaner. Now if in case of KEEP_LATEST_FILE_VERSIONS, the cleaner will only keep commit-2.parquet. Then it makes sense that incremental cleaning won't consider partition-A until it is changed. Because there is only one commit. This is why incremental cleaning should only be enabled with KEEP_LATEST_FILE_VERSIONS Hope this is clear enough **Risk level: none | low | medium | high** _Choose one. If medium or high, explain what verification was done to mitigate the risks._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
