Re: [I] [to be discussed] Configure clean on spark to gracefully handle a large increase in uncleaned files. [hudi]

via GitHub Wed, 11 Feb 2026 19:17:31 -0800


vinothchandar commented on issue #17844:
URL: https://github.com/apache/hudi/issues/17844#issuecomment-3888442695


   > In order to minimize chance of full table scan clean (if archival were to 
run before ECTR progressed) and avoid incremental clean repeatedly re-reading 
the same instants from last ECTR (which takes up time).
   
   My thought process here, is to first microbenchmark this at some rough scale 
points you can share. Then decide. Anything we plan, I'd love to plan alongside 
LSM timeline. It's not worth spending efforts on other angles IMO. it stretches 
us thin. 
   
   > But for clean/ECTR, this doesn't fully address the issue (unlike empty 
clean), since we need to make sure the instant corresponding to the ECTR is 
getting "progressed" even if there is nothing to clean
   
   To your point,  yes there are still complexities. 
   
   >when we port it to upstream we can remove the requirement of archival 
needing to block for ECTR
   
   I'd love to sketch a full end-end solution for this. IMO your requirement of 
running cleaning sporadically i.e clean plan may not run for months and always 
wanting incremental cleans are at conflict with each other. We can make this 
work as long as clean is scheduled at-least every few hours for each table 
being managed. i.e system needs liveness. 
   
   
   Even for approach (b),  whether we can advance ECTR safely in a guaranteed 
fashion before hitting archival, depends on write activity, e.g delta commits, 
commits, replace commits. For e.g if you have an existing ECTR at a completed 
clean, only .deltacommit/ .commit that adds files only (vs producing new file 
slices) can safely advance ECTR. So, if we get a replace commit in between, we 
cannot advance ECTR beyond that; ie`clean, c1, c2, ...., c10, replace commit 
RC| update commit, c11, ... , c50` , we cannot advance ECTR beyond the 
replace/update commit, right.. and the left side instants could become 
candidates for archival.. 
   
   In short, we need to embrace the LSM timeline (and become performant at 
scanning all instants from a given point) [replay log] (OR) after a point, it 
makes no sense to do log replay i.e we just read MDT and plan the full clean.. 
I feel we are overdesigning around performance without much real benchmarks,


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [to be discussed] Configure clean on spark to gracefully handle a large increase in uncleaned files. [hudi]

Reply via email to