Re: [I] [to be discussed] Configure clean on spark to gracefully handle a large increase in uncleaned files. [hudi]

via GitHub Wed, 04 Feb 2026 15:07:32 -0800


vinothchandar commented on issue #17844:
URL: https://github.com/apache/hudi/issues/17844#issuecomment-3850182551


   @kbuci thanks for the detailed summary.  I skimmed the PR. tbh - 
understanding the scale issues (so we can at-least try and reproducing the 
bottleneck with full cleaning) seems like the first thing to try. 
   
   More thoughts below: 
   
   > we would still need some "operation" that wouldn't actually clean anything 
but would write a new ECTR to the MDT
   
   Right now that operation is the clean planning, which then writes the empty 
commit. But you are right that something needs to 'advance' the ECTR 
periodically and I don't want to add any new table services around this. One 
idea I am thinking is as follows and can be helpful for the other checkpoint 
maintenance track as well.  https://github.com/apache/hudi/issues/17848 
   
   
   > if we can maintain the table service state also in the MDT..
   Instead of MDT, we could just make ECTR, checkpoints rolling over metadata 
in the timeline. If you think about this conceptually, we can either model this 
as `state` (MDT) or `log` (timeline). 
   
   Any action completed to the timeline is already performed under a lock (if 
you have multi writers or async table services). So we can safely aggregate 
such metadata with each write. i.e ECTR would be stored in every instant on the 
active timeline (ofc a new clean can change that value), even non clean 
actions. This is way more flexible, we can point to the actual last clean 
instant in the LSM timeline if needed, read parts of the log efficiently for 
any planning activity..  
   
   **Question**: Can you confirm that - if the ECTR is always retained in the 
last instant of timeline - we can resolve your scenario? 
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [to be discussed] Configure clean on spark to gracefully handle a large increase in uncleaned files. [hudi]

Reply via email to