nsivabalan commented on PR #6498:
URL: https://github.com/apache/hudi/pull/6498#issuecomment-1248629629

   here is my take:
   don't think we can go w/ this fix as is. but we can try something else. 
   
   we need to maintain another variable called, `lastCompletedCommit` whenever 
a cleaner kicks in. so, it will be added as part of clean commit metadata. 
infact we landed a patch 2 days back that adds this. 
https://github.com/apache/hudi/pull/5478
   
   let me walk through a scenario. 
   cleaner policy: based on latest file versions.
   count: 2 
   
   Commit1 :
   FG1_V1, FG2_V2
   
   Commit2: 
   FG1_V2, FG3_V1
   
   Commit3: FG1_V3, FG3_V2, FG4_V1
   
   Commit4: FG1_V4, FG2_V2, FG4_V2, FG5_V1
   
   Commit5: FG2_V3, FG5_V2
   
   for first two commits, nothing will get cleaned up. and so 
"lastCompletedCommit" = null. 
   
   just after C3, lets say we trigger cleaning. 
   C3_C
   FG_1_V1 will be cleaned up. 
   and will mark `lastCompletedCommit` = C3. 
   
   just after C4, we trigger cleaner again. 
   we can just look at file groups that got newer versions after C3 until now. 
in this case, just in C4.
   So, while planning we can consider only file groups FG1, FG2, FG4 and FG5. 
reasoning is, those file groups which was never touched after last clean may 
not reach the max file version threshold. bcoz, they did not receive any 
updates only. 
   
   So, with C4_Clean, we will consider only FG1, FG2, FG4 and FG5 and 
eventually will clean up FG1_V2. 
   
   And update `lastCompletedCommit` = C4. 
   
   just after C5, we trigger cleaner again. 
   will consider only file groups updated after C4 (lastCompletedCommit from 
last clean). which is FG2, FG5. 
   and will result in cleaning up FG2_V1. since FG1, FG3 nor FG4 was touched, 
we don't need to even consider them while preparing the clean plan. 
    
   
   the current patch in its current state might miss out something. we can't 
rely on earlistRetainedCommit. For eg, just after C2, earlistCommitToRetain is 
being set to C2. but ideally we should consider lastCompletedCommit from the 
last time cleaner got triggered and do incremental polling for newer commits 
from there. 
   
   
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to