parisni opened a new pull request, #6498:
URL: https://github.com/apache/hudi/pull/6498

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   Currently incremental cleaning is run for both  KEEP_LATEST_COMMITS, 
KEEP_LATEST_BY_HOURS
   policies. It is not run when KEEP_LATEST_FILE_VERSIONS.
   
   This can lead to not cleaning files. This PR fixes this problem by enabling 
incremental cleaning for KEEP_LATEST_FILE_VERSIONS only.
   
   Here is the scenario of the problem:
   
   Say we have 3 commited files in a given partition and we add a new commit in 
partition-B, and we trigger cleaning for the first time:
   ```
   partition-A/
   commit-0.parquet
   commit-1.parquet
   commit-2.parquet
   partition-B/
   commit-3.parquet
   ```
   In the case say we chosed KEEP_LATEST_COMMITS  with 
CLEANER_COMMITS_RETAINED=3, the cleaner will remove the  commit-0.parquet to 
keep 3 commits.
   For the next cleaning, incremental cleaning will trigger, and won't consider 
partition-A/ until a new commit change it. In case no later commit changes 
partition-A then commit-1.parquet will stay forever. However it should be 
removed by the cleaner.
   
   Now if in case of KEEP_LATEST_FILE_VERSIONS, the cleaner will only keep 
commit-2.parquet. Then it makes sense that incremental cleaning won't consider 
partition-A until it is changed. Because there is only one commit.
   
   This is why incremental cleaning should only be enabled with 
KEEP_LATEST_FILE_VERSIONS
   
   Hope this is clear enough
   
   
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to