kbuci commented on issue #17844: URL: https://github.com/apache/hudi/issues/17844#issuecomment-3894416252
@vinothchandar Thanks for follow-up. Our requirement that we are discussing currently I believe is still on the same track as what we mentioned earlier: https://github.com/apache/hudi/issues/17844#issuecomment-3815413409 , where as you put it: > Issue 1 : Avoiding cleaning fallback to full scan mode, when table is updated/clustered sporadically. I agree that whatever solutions we decide on should ideally be agnostic to write workload of dataset, but taking a step back and going back to the problem, our specific issue/workloads that caused use to raise this requirement is > A dataset has an ingestion spark job which only writes bulk-insert instants to the latest partition. But after a few years, a second concurrent job does a clustering write. This causes the next clean planner attempt (in the ingestion job) to scan thousands of partitions in the dataset, since the prior ECTR doesn't exist or there have been many instants since it. We have seen increased runtimes and spark driver OOM failures - even if there are only a few files to actually "clean" in the clean plan. Or to be even more specific (assuming that clean is being reliably attempted on the dataset): - We have (day based partitioned) datasets with 1000s of partitions - with current setup "full table scan" clean planning takes 10+ minutes - Each ingestion write only targets partitions within the last few days, creating only new file groups in each partition - If there are no updates/replace operations for a few days, then ECTR will fall out of active timeline. - The next time a update/replace operation happens, HUDI then attempts a clean right after, but it now has to fallback to full scan clean. Which has the following impact: -- If the clean is "inline" during post-commit ingestion phase, then the next write gets delayed -- If the clean is ran by a concurrent "table service platform", then the table lock will be held for longer, and indirectly delay ingestion Our internal approach to resolve this was to make sure we had a way to constantly "bump up" the ECTR even if there was nothing to clean. Which is why the "empty clean" solution. Later on though, we came across a different problem/issue that applies to more write workloads: clean is paused/not running for a long time. For our org this typically happened due to - We disable auto clean in ingestion write jobs, due to perf issues (OOM issues mentioned earlier or runtime increases like https://github.com/apache/hudi/pull/18016 ). Because we didn't have sufficient spark resources, we couldn't onboard this to our table service platform for cleaning. So until the fix was deployed a few weeks later, clean was paused. - The next time clean is unpaused, the ingestion job will resort to full table scan. Unless the dataset workload updates a high enough % of older partition, then this will typically be worse performance than an incremental clean. And because of this, we later added a guardrail to our internal HUDI build to block archival on ECTR As you already brought up though, this is an orchestration issues "outside" of HUDI. I had a discussion with @nsivabalan , and it seems that rather than having our internal hudi build keep on blocking archival on ECTR, we can re-frame the problem as "only attempt archival if clean has finished". For inline table services this means having a way to automatically disable archival if clean is disabled. I will have some more internal discussions on this, but we can track/discuss this separately, and just focus on the "first" issue (bumping up ECTR if there is nothing to clean) in this thread if thats better. We just wanted to bring up the reason for why we currently block archival on ECTR, since I believe you had asked it earlier. Unfortunately for your other question > Always to incremental cleaning in these scenarios I will need some more time to provide a concise answer. Since although hypothetically a large enough timeline will cause a severe read/write runtime degradation, we have not really any encountered any scenario in production where the ingestion/query runtime was so long, that we had to disallow ingesting new records into the dataset (until an archival was forced). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
