nsivabalan commented on issue #17844: URL: https://github.com/apache/hudi/issues/17844#issuecomment-3792867838
1a. how to make clean planning efficient for an immutable dataset which has only bulk_inserts, but might have clustering or insert overwrites occasionally. Our best solution is empty cleans for now. Lets propose this in next dev sync meeting. even w/ 1.x, we need to support this use-case for v6 tables. 1.b: for the same use-case, for v9 tables in 1.x, we can afford to lookup in lsm timeline all the way to prev clean instant if need be. So, even if we take this route for v9 tables, we need a solution for v6 tables (1.a) 2. Avoid OOMs w/ very large cleans. If clean was not running for a longer period of time, when we resume, we do not want to clean say 100 commits in one go. We need a way to make slower progress and eventually catchup w/o causing OOMs. 3. If there are too many partitions to touch (DIM workload), even for single commit, planning for all partitions could result in OOMs. yet to be re-triaged. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
