Re: [I] [to be discussed] Configure clean on spark to gracefully handle a large increase in uncleaned files. [hudi]

via GitHub Mon, 09 Feb 2026 14:27:52 -0800


kbuci commented on issue #17844:
URL: https://github.com/apache/hudi/issues/17844#issuecomment-3874199431

Thanks for the suggestion, I had a chat with @nsivabalan as well on this and
I think this approach (putting the extraMetadata/checkpoint info from last
ingestion write in each commit/replacecommit instant regardless of operation
type) solves the concern of "losing" ingestion checkpoint info. So we can
replace the "empty" commit
https://github.com/apache/hudi/pull/11606/changes#diff-9d596db608948d6ebe428c69c82e287baef15e50f9382a701d7eb98260e51aff
approach with this suggested approach.

But for clean/ECTR, this doesn't fully address the issue (unlike empty
clean), since we need to make sure the instant corresponding to the ECTR is
getting "progressed" even if there is nothing to clean. In order to minimize
chance of full table scan clean (if archival were to run before ECTR
progressed) and avoid incremental clean repeatedly re-reading the same instants
from last ECTR (which takes up time). My first thought is that in order to
address this (clean) issue we would need to either

a. Implement empty clean. But unlike our org's empty clean approach, when we
port it to upstream we can remove the requirement of archival needing to block
for ECTR. And we can discuss that guardrail separately, since even empty clean
by itself will mostly solve the aforementioned incremental cleaning gaps (if
the user is reliably attempting clean on a regular cadence and monitoring for
transient failures).

b. We can build on your suggestion here (of storing the ECTR in every
(delta)commit/replacecommit write): when committing the instant the writer
itself can update the ECTR if it "knows" that there is nothing behind left to
clean since last ECTR, based on instant metadata. Specifically, it will check
the instant file metadata of the ECTR and every instant after, and if there are
no compacted/replaced/updated( base files), then it will bump up the ECTR as
far as it can. I assume the community would not want to go with this approach,
since it duplicates some clean logic in ingestion write path and couples clean
with ingestion write logic. But just wanted to share it as an alternative

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [to be discussed] Configure clean on spark to gracefully handle a large increase in uncleaned files. [hudi]

Reply via email to