parisni commented on PR #6498:
URL: https://github.com/apache/hudi/pull/6498#issuecomment-1237390122

   well, after read again in particular that method
   
https://github.com/apache/hudi/blob/ca8a57a21d163e573e3a617fd6173fa4b913666c/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L177
   it does what exactly what the logger says :
   ```
       LOG.info("Incremental Cleaning mode is enabled. Looking up
   partition-paths that have since changed "
           + "since last cleaned at " +
   cleanMetadata.getEarliestCommitToRetain()
           + ". New Instant to retain : " + newInstantToRetain);
   ```
   
   But you were right : newInstantToRetain is `commit2`. In my mind it was
   the cleaning Commit which woud have been `commit4`.
   
   > this may not be applicable w/ KEEP_LATEST_FILE_VERSIONS. bcoz, we
   > can't pin point a commit and say everything before that commit can
   > be ignore for future cleaning. and thus incase of
   > KEEP_LATEST_FILE_VERSIONS, we can't do incremental cleaning
   
   if we pin the cleaning commit (`commit4`) then we can apply incremental
   cleaning together with `KEEP_LATEST_FILE_VERSIONS`.
   
   On Mon, 2022-09-05 at 08:32 +0000, Nicolas Paris wrote:
   > > For the next cleaning, incremental cleaning will trigger, and will
   > > comb through all commits >= earliest commit retained i.e. commit2.
   > > and so file2_v1 will be deleted this time. and will update
   > > the earliest commit retained to commit3 now.
   > 
   > I assume you meant file1_v2 ?
   > 
   > Let me read again the source that's not what I understood and also
   > tested so far. 
   > 
   > On September 4, 2022 1:30:03 AM UTC, Sivabalan Narayanan
   > ***@***.***> wrote:
   > > got it, thanks. 
   > > Let me go through the example you have put up and clarify few
   > > things.
   > > 
   > > 
   > > Say we have 3 committed files in partition-A and we add a new
   > > commit in partition-B, and we trigger cleaning for the first time
   > > (full partition scan):
   > > 
   > > ```
   > > partition-A/
   > > commit-0 added file1_V1.parquet
   > > commit-1. added file1_V2.parquet
   > > commit-2 added file1_V3.parquet
   > > partition-B/
   > > commit-3 added file2_V1.parquet
   > > ```
   > > 
   > > In the case say we have KEEP_LATEST_COMMITS with
   > > CLEANER_COMMITS_RETAINED=3, the cleaner will remove the files
   > > created by commit-0 and keep 3 commits. ie. file1_V1.parquet will
   > > be cleaned up. But hudi also keeps track of `earliest commit
   > > retained` in this case which is commit2. This `earliest commit
   > > retained` is the one we will leverage later to do incremental
   > > cleaning. 
   > > 
   > > For the next cleaning, incremental cleaning will trigger, and will
   > > comb through all commits >= `earliest commit retained` i.e.
   > > commit2. and so file2_v1 will be deleted this time. and will update
   > > the `earliest commit retained` to commit3 now. 
   > > 
   > > this may not be applicable w/ KEEP_LATEST_FILE_VERSIONS. bcoz, we
   > > can't pin point a commit and say everything before that commit can
   > > be ignore for future cleaning. and thus incase of
   > > KEEP_LATEST_FILE_VERSIONS, we can't do incremental cleaning. It is
   > > in this policy, we might encounter a corner case where, a file
   > > group was updated only in Commit 1 and commit and never updated
   > > later. and after a long time, had a new version in say commit 100.
   > > we need to clean up the first version (assuming
   > > KEEP_LATEST_FILE_VERSIONS count is 2). 
   > > 
   > > Let me know if this makes sense. or if you still feel, I am missing
   > > something, can you elaborate w/ an example. 
   > > 
   > > 
   > > 
   > > 
   > > -- 
   > > Reply to this email directly or view it on GitHub:
   > > https://github.com/apache/hudi/pull/6498#issuecomment-1236228935
   > > You are receiving this because you were mentioned.
   > > 
   > > Message ID: ***@***.***>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to