nsivabalan commented on PR #6498: URL: https://github.com/apache/hudi/pull/6498#issuecomment-1248629629
here is my take: don't think we can go w/ this fix as is. but we can try something else. we need to maintain another variable called, `lastCompletedCommit` whenever a cleaner kicks in. so, it will be added as part of clean commit metadata. infact we landed a patch 2 days back that adds this. https://github.com/apache/hudi/pull/5478 let me walk through a scenario. cleaner policy: based on latest file versions. count: 2 Commit1 : FG1_V1, FG2_V2 Commit2: FG1_V2, FG3_V1 Commit3: FG1_V3, FG3_V2, FG4_V1 Commit4: FG1_V4, FG2_V2, FG4_V2, FG5_V1 Commit5: FG2_V3, FG5_V2 for first two commits, nothing will get cleaned up. and so "lastCompletedCommit" = null. just after C3, lets say we trigger cleaning. C3_C FG_1_V1 will be cleaned up. and will mark `lastCompletedCommit` = C3. just after C4, we trigger cleaner again. we can just look at file groups that got newer versions after C3 until now. in this case, just in C4. So, while planning we can consider only file groups FG1, FG2, FG4 and FG5. reasoning is, those file groups which was never touched after last clean may not reach the max file version threshold. bcoz, they did not receive any updates only. So, with C4_Clean, we will consider only FG1, FG2, FG4 and FG5 and eventually will clean up FG1_V2. And update `lastCompletedCommit` = C4. just after C5, we trigger cleaner again. will consider only file groups updated after C4 (lastCompletedCommit from last clean). which is FG2, FG5. and will result in cleaning up FG2_V1. since FG1, FG3 nor FG4 was touched, we don't need to even consider them while preparing the clean plan. the current patch in its current state might miss out something. we can't rely on earlistRetainedCommit. For eg, just after C2, earlistCommitToRetain is being set to C2. but ideally we should consider lastCompletedCommit from the last time cleaner got triggered and do incremental polling for newer commits from there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
