kbuci opened a new issue, #17844: URL: https://github.com/apache/hudi/issues/17844
### Task Description **What needs to be done:** For instant/duration based cleaning policy (KEEP_LATEST_COMMITS/KEEP_LATEST_BY_HOURS), when would like the implementation of clean planner/execution in spark to be optimized to meet the following requirement: - If there are new files to be cleaned, then the next clean should be scheduled and executed in a bounded amount of time/memory usage, regardless of how many seconds/commits have elapsed since the prior ECTR (clean's earliest commit to retain). It is acceptable if the new `.clean` only makes "partial" progress and blocks the new ECTR, as long as the next clean plan "resumes" from where this previous clean completed. **Why this task is needed:** For context, we would like to handle the following scenarios that we have seen for our use cases - A dataset has an ingestion spark job which only writes bulk-insert instants. But after a few years, a second concurrent job does a clustering write. This causes the next clean planner attempt (in the ingestion job) to scan thousands of partitions in the dataset, since the prior ECTR doesn't exist or there have been many instants since it. We have seen increased runtimes and spark driver OOM failures - even if there are only a few files to actually "clean" in the clean plan. - An update-heavy dataset has had no cleans be attempted (due to a misconfiguration or orchestration issue). When clean runs again, it can OOM when creating the .clean/.clean.requested due to a high # files to clean. Note that both incremental and non-incremental clean planner determine a list of partitions, and find all updated/replaced file groups before new/proposed ECTR. The difference is the non-incremental clean scans all partitions in the dataset, instead of only partitions referenced by instants since the latest ECTR. ### Task Type Code improvement/refactoring ### Related Issues **Parent feature issue:** (if applicable ) **Related issues:** NOTE: Use `Relationships` button to add parent/blocking issues after issue is created. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
