Re: [I] [to be discussed] Optimize clean on spark to gracefully handle a large increase in uncleaned files. [hudi]

via GitHub Tue, 20 Jan 2026 16:55:27 -0800


kbuci commented on issue #17844:
URL: https://github.com/apache/hudi/issues/17844#issuecomment-3775629090


   > Thanks [@kbuci](https://github.com/kbuci) for the feature ask.
   > 
   > * May I know if we are looking for a fix in latest master (1.x), or 0.x as 
well.
   Just for 1.x, latest master
   > * Clarification on the 2nd fix w/n your organization: I understand the 
config to control the "num instants to clean" in every clean operation. But did 
you guys also consider "num files to clean" config. w/ num files, we can set it 
to 100k or something which would work for all datasets right. Just that with 
"num instants to clean", as you pointed out, we might need to manually 
configure diff values for diff streams. 
   
   We considered having "num files to clean" configs in internal design 
discussions, due to the reasons you raised. Essentially I was thinking of 
having two types of "knobs"
   - The "num files to clean" . 
   - The storage footprint of the clean planner. For example, even if there is 
nothing to clean, if there are thousands of partitions with thousands of files 
each, then the resulting in-memory filesystem view object creating during clean 
planner may cause memory pressure. For such datasets/workloads, we would 
probably want some sort of limit on `partition count scanned x files per 
partition`.
   
   For incremental clean this may mean at the worst case we would only move the 
ECTR one instant. For full-scan clean its a bit more tricky - we would have to 
"partially" clean a subset of the partitions per run of clean. 
   
   Also even if we were to add these configs, the next issue would be how to 
have users know what values to set? It would be ideal if a user could pass in a 
general config for clean dictating "how much memory should clean use at most", 
and then HUDI could infer the values for the aforementioned configs
   
   > Was just curious on any brainstorming you did before going ahead w/ "num 
instants to clean" based solution.
   Back when we implemented that config internally, it was mainly done urgently 
so that we could quickly mitigate production issues with upsert-heavy datasets. 
Where if clean was paused for a while and resumed later, the next clean would 
have to process a large enough count of files to cause an OOM. So regardless of 
if this config makes it in the final design here, we don't expect this internal 
config to be sufficient for all the workloads/scenarios discussed above
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [to be discussed] Optimize clean on spark to gracefully handle a large increase in uncleaned files. [hudi]

Reply via email to