Re: [I] [to be discussed] Configure clean on spark to gracefully handle a large increase in uncleaned files. [hudi]

via GitHub Thu, 12 Feb 2026 18:08:31 -0800


kbuci commented on issue #17844:
URL: https://github.com/apache/hudi/issues/17844#issuecomment-3894416252


   @vinothchandar Thanks for follow-up.
   
   Our requirement that we are discussing currently I believe is still on the 
same track as what we mentioned earlier: 
https://github.com/apache/hudi/issues/17844#issuecomment-3815413409 , where as 
you put it:
   
   > Issue 1 : Avoiding cleaning fallback to full scan mode, when table is 
updated/clustered sporadically.
   
   I agree that whatever solutions we decide on should ideally be agnostic to 
write workload of dataset, but taking a step back and going back to the 
problem, our specific issue/workloads that caused use to raise this requirement 
is 
   
   > A dataset has an ingestion spark job which only writes bulk-insert 
instants to the latest partition. But after a few years, a second concurrent 
job does a clustering write. This causes the next clean planner attempt (in the 
ingestion job) to scan thousands of partitions in the dataset, since the prior 
ECTR doesn't exist or there have been many instants since it. We have seen 
increased runtimes and spark driver OOM failures - even if there are only a few 
files to actually "clean" in the clean plan.
   
   Or to be even more specific (assuming that clean is being reliably attempted 
on the dataset):
   -  We have (day based partitioned) datasets with 1000s of partitions - with 
current setup "full table scan" clean planning takes 10+ minutes
   - Each ingestion write only targets partitions within the last few days, 
creating only new file groups in each partition
   -  If there are no updates/replace operations for a few days, then ECTR will 
fall out of active timeline. 
   - The next time a update/replace operation happens, HUDI then attempts a 
clean right after, but it now has to fallback to full scan clean. Which has the 
following impact:
   -- If the clean is "inline" during post-commit ingestion phase, then the 
next write gets delayed
   -- If the clean is ran by a concurrent "table service platform", then the 
table lock will be held for longer, and indirectly delay ingestion
   
   Our internal approach to resolve this was to make sure we had a way to 
constantly "bump up" the ECTR even if there was nothing to clean. Which is why 
the "empty clean" solution. 
   
   Later on though, we came across a different problem/issue that applies to 
more write workloads: clean is paused/not running for a long time. For our org 
this typically happened due to 
   - We disable auto clean in ingestion write jobs, due to perf issues (OOM 
issues mentioned earlier or runtime increases like  
https://github.com/apache/hudi/pull/18016 ). Because we didn't have sufficient 
spark resources, we couldn't onboard this to our table service platform for 
cleaning. So until the fix was deployed a few weeks later, clean was paused.
   - The next time clean is unpaused, the ingestion job will resort to full 
table scan. Unless the dataset workload updates a high enough % of older 
partition, then this will typically be worse performance than an incremental 
clean. 
   And because of this, we later added a guardrail to our internal HUDI build 
to block archival on ECTR
   
   As you already brought up though, this is an orchestration issues "outside" 
of HUDI. I had a discussion with @nsivabalan , and it seems that rather than 
having our internal hudi build keep on blocking archival on ECTR, we can 
re-frame the problem as "only attempt archival if clean has finished". For 
inline table services this means having a way to automatically disable archival 
if clean is disabled. I will have some more internal discussions on this, but 
we can track/discuss this separately, and just focus on the "first" issue 
(bumping up ECTR if there is nothing to clean) in this thread if thats better. 
We just wanted to bring up the reason for why we currently block archival on 
ECTR, since I believe you had asked it earlier.
   
   Unfortunately for your other question
   
   > Always to incremental cleaning in these scenarios
   
   I will need some more time to provide a concise answer. Since although 
hypothetically a large enough timeline will cause a severe read/write runtime 
degradation, we have not really any encountered any scenario in production 
where the ingestion/query runtime was so long, that we had to disallow 
ingesting new records into the dataset (until an archival was forced). 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [to be discussed] Configure clean on spark to gracefully handle a large increase in uncleaned files. [hudi]

Reply via email to