parisni commented on PR #6498:
URL: https://github.com/apache/hudi/pull/6498#issuecomment-1236697613

   > For the next cleaning, incremental cleaning will trigger, and will comb 
through all commits >= earliest commit retained i.e. commit2. and so file2_v1 
will be deleted this time. and will update the earliest commit retained to 
commit3 now.
   
   I assume you meant file1_v2 ?
   
   Let me read again the source that's not what I understood and also tested so 
far. 
   
   On September 4, 2022 1:30:03 AM UTC, Sivabalan Narayanan ***@***.***> wrote:
   >got it, thanks. 
   >Let me go through the example you have put up and clarify few things.
   >
   >
   >Say we have 3 committed files in partition-A and we add a new commit in 
partition-B, and we trigger cleaning for the first time (full partition scan):
   >
   >```
   >partition-A/
   >commit-0 added file1_V1.parquet
   >commit-1. added file1_V2.parquet
   >commit-2 added file1_V3.parquet
   >partition-B/
   >commit-3 added file2_V1.parquet
   >```
   >
   >In the case say we have KEEP_LATEST_COMMITS with 
CLEANER_COMMITS_RETAINED=3, the cleaner will remove the files created by 
commit-0 and keep 3 commits. ie. file1_V1.parquet will be cleaned up. But hudi 
also keeps track of `earliest commit retained` in this case which is commit2. 
This `earliest commit retained` is the one we will leverage later to do 
incremental cleaning. 
   >
   >For the next cleaning, incremental cleaning will trigger, and will comb 
through all commits >= `earliest commit retained` i.e. commit2. and so file2_v1 
will be deleted this time. and will update the `earliest commit retained` to 
commit3 now. 
   >
   >this may not be applicable w/ KEEP_LATEST_FILE_VERSIONS. bcoz, we can't pin 
point a commit and say everything before that commit can be ignore for future 
cleaning. and thus incase of KEEP_LATEST_FILE_VERSIONS, we can't do incremental 
cleaning. It is in this policy, we might encounter a corner case where, a file 
group was updated only in Commit 1 and commit and never updated later. and 
after a long time, had a new version in say commit 100. we need to clean up the 
first version (assuming KEEP_LATEST_FILE_VERSIONS count is 2). 
   >
   >Let me know if this makes sense. or if you still feel, I am missing 
something, can you elaborate w/ an example. 
   >
   >
   >
   >
   >-- 
   >Reply to this email directly or view it on GitHub:
   >https://github.com/apache/hudi/pull/6498#issuecomment-1236228935
   >You are receiving this because you were mentioned.
   >
   >Message ID: ***@***.***>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to