[I] [to be discussed] Optimize clean on spark to gracefully handle a large increase in uncleaned files. [hudi]

via GitHub Tue, 13 Jan 2026 11:34:35 -0800


kbuci opened a new issue, #17844:
URL: https://github.com/apache/hudi/issues/17844


   ### Task Description
   
   **What needs to be done:**
   For instant/duration based cleaning policy 
(KEEP_LATEST_COMMITS/KEEP_LATEST_BY_HOURS), when would like the implementation 
of clean planner/execution in spark to be optimized to meet the following 
requirement:
   
   - If there are new files to be cleaned, then the next clean should be 
scheduled and executed in a bounded amount of time/memory usage, regardless of 
how many seconds/commits have elapsed since the prior ECTR (clean's earliest 
commit to retain). It is acceptable if the new `.clean` only makes "partial" 
progress and blocks the new ECTR, as long as the next clean plan "resumes" from 
where this previous clean completed.
   
   
   **Why this task is needed:**
   
   For context, we would like to handle the following scenarios that we have 
seen for our use cases
   - A dataset has an ingestion spark job which only writes bulk-insert 
instants. But after a few years, a second concurrent job does a clustering 
write. This causes the next clean planner attempt (in the ingestion job) to 
scan thousands of partitions in the dataset, since the prior ECTR doesn't exist 
or there have been many instants since it. We have seen increased runtimes and 
spark driver OOM failures - even if there are only a few files to actually 
"clean" in the clean plan.
   - An update-heavy dataset has had no cleans be attempted (due to a 
misconfiguration or orchestration issue). When clean runs again, it can OOM 
when creating the .clean/.clean.requested due to a high # files to clean.
   
   Note that both incremental and non-incremental clean planner determine a 
list of partitions, and find all updated/replaced file groups before 
new/proposed ECTR. The difference is the non-incremental clean scans all 
partitions in the dataset, instead of only partitions referenced by instants 
since the latest ECTR.
   
   ### Task Type
   
   Code improvement/refactoring
   
   ### Related Issues
   
   **Parent feature issue:** (if applicable )
   **Related issues:**
   NOTE: Use `Relationships` button to add parent/blocking issues after issue 
is created.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [to be discussed] Optimize clean on spark to gracefully handle a large increase in uncleaned files. [hudi]

Reply via email to