vinothchandar commented on issue #17844: URL: https://github.com/apache/hudi/issues/17844#issuecomment-3850182551
@kbuci thanks for the detailed summary. I skimmed the PR. tbh - understanding the scale issues (so we can at-least try and reproducing the bottleneck with full cleaning) seems like the first thing to try. More thoughts below: > we would still need some "operation" that wouldn't actually clean anything but would write a new ECTR to the MDT Right now that operation is the clean planning, which then writes the empty commit. But you are right that something needs to 'advance' the ECTR periodically and I don't want to add any new table services around this. One idea I am thinking is as follows and can be helpful for the other checkpoint maintenance track as well. https://github.com/apache/hudi/issues/17848 > if we can maintain the table service state also in the MDT.. Instead of MDT, we could just make ECTR, checkpoints rolling over metadata in the timeline. If you think about this conceptually, we can either model this as `state` (MDT) or `log` (timeline). Any action completed to the timeline is already performed under a lock (if you have multi writers or async table services). So we can safely aggregate such metadata with each write. i.e ECTR would be stored in every instant on the active timeline (ofc a new clean can change that value), even non clean actions. This is way more flexible, we can point to the actual last clean instant in the LSM timeline if needed, read parts of the log efficiently for any planning activity.. **Question**: Can you confirm that - if the ECTR is always retained in the last instant of timeline - we can resolve your scenario? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
