Re: [I] [to be discussed] Optimize clean on spark to gracefully handle a large increase in uncleaned files. [hudi]

via GitHub Fri, 23 Jan 2026 14:33:58 -0800


nsivabalan commented on issue #17844:
URL: https://github.com/apache/hudi/issues/17844#issuecomment-3792867838


   1a. how to make clean planning efficient for an immutable dataset which has 
only bulk_inserts, but might have clustering or insert overwrites occasionally. 
Our best solution is empty cleans for now. Lets propose this in next dev sync 
meeting. even w/ 1.x, we need to support this use-case for v6 tables.
   1.b: for the same use-case, for v9 tables in 1.x, we can afford to lookup in 
lsm timeline all the way to prev clean instant if need be. So, even if we take 
this route for v9 tables, we need a solution for v6 tables (1.a)
   
   2. Avoid OOMs w/ very large cleans. If clean was not running for a longer 
period of time, when we resume, we do not want to clean say 100 commits in one 
go. We need a way to make slower progress and eventually catchup w/o causing 
OOMs. 
   
   3. If there are too many partitions to touch (DIM workload), even for single 
commit, planning for all partitions could result in OOMs. yet to be re-triaged. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [to be discussed] Optimize clean on spark to gracefully handle a large increase in uncleaned files. [hudi]

Reply via email to