nsivabalan commented on PR #6498:
URL: https://github.com/apache/hudi/pull/6498#issuecomment-1236228935

   got it, thanks. 
   Let me go through the example you have put up and clarify few things.
   
   
   Say we have 3 committed files in partition-A and we add a new commit in 
partition-B, and we trigger cleaning for the first time (full partition scan):
   
   ```
   partition-A/
   commit-0 added file1_V1.parquet
   commit-1. added file1_V2.parquet
   commit-2 added file1_V3.parquet
   partition-B/
   commit-3 added file2_V1.parquet
   ```
   
   In the case say we have KEEP_LATEST_COMMITS with CLEANER_COMMITS_RETAINED=3, 
the cleaner will remove the files created by commit-0 and keep 3 commits. ie. 
file1_V1.parquet will be cleaned up. But hudi also keeps track of `earliest 
commit retained` in this case which is commit2. This `earliest commit retained` 
is the one we will leverage later to do incremental cleaning. 
   
   For the next cleaning, incremental cleaning will trigger, and will comb 
through all commits >= `earliest commit retained` i.e. commit2. and so file2_v1 
will be deleted this time. and will update the `earliest commit retained` to 
commit3 now. 
   
   this may not be applicable w/ KEEP_LATEST_FILE_VERSIONS. bcoz, we can't pin 
point a commit and say everything before that commit can be ignore for future 
cleaning. and thus incase of KEEP_LATEST_FILE_VERSIONS, we can't do incremental 
cleaning. It is in this policy, we might encounter a corner case where, a file 
group was updated only in Commit 1 and commit and never updated later. and 
after a long time, had a new version in say commit 100. we need to clean up the 
first version (assuming KEEP_LATEST_FILE_VERSIONS count is 2). 
   
   Let me know if this makes sense. or if you still feel, I am missing 
something, can you elaborate w/ an example. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to