vinothchandar commented on issue #17844: URL: https://github.com/apache/hudi/issues/17844#issuecomment-3888442695
> In order to minimize chance of full table scan clean (if archival were to run before ECTR progressed) and avoid incremental clean repeatedly re-reading the same instants from last ECTR (which takes up time). My thought process here, is to first microbenchmark this at some rough scale points you can share. Then decide. Anything we plan, I'd love to plan alongside LSM timeline. It's not worth spending efforts on other angles IMO. it stretches us thin. > But for clean/ECTR, this doesn't fully address the issue (unlike empty clean), since we need to make sure the instant corresponding to the ECTR is getting "progressed" even if there is nothing to clean To your point, yes there are still complexities. >when we port it to upstream we can remove the requirement of archival needing to block for ECTR I'd love to sketch a full end-end solution for this. IMO your requirement of running cleaning sporadically i.e clean plan may not run for months and always wanting incremental cleans are at conflict with each other. We can make this work as long as clean is scheduled at-least every few hours for each table being managed. i.e system needs liveness. Even for approach (b), whether we can advance ECTR safely in a guaranteed fashion before hitting archival, depends on write activity, e.g delta commits, commits, replace commits. For e.g if you have an existing ECTR at a completed clean, only .deltacommit/ .commit that adds files only (vs producing new file slices) can safely advance ECTR. So, if we get a replace commit in between, we cannot advance ECTR beyond that; ie`clean, c1, c2, ...., c10, replace commit RC| update commit, c11, ... , c50` , we cannot advance ECTR beyond the replace/update commit, right.. and the left side instants could become candidates for archival.. In short, we need to embrace the LSM timeline (and become performant at scanning all instants from a given point) [replay log] (OR) after a point, it makes no sense to do log replay i.e we just read MDT and plan the full clean.. I feel we are overdesigning around performance without much real benchmarks, -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
