kbuci commented on issue #17844: URL: https://github.com/apache/hudi/issues/17844#issuecomment-3874199431
Thanks for the suggestion, I had a chat with @nsivabalan as well on this and I think this approach (putting the extraMetadata/checkpoint info from last ingestion write in each commit/replacecommit instant regardless of operation type) solves the concern of "losing" ingestion checkpoint info. So we can replace the "empty" commit https://github.com/apache/hudi/pull/11606/changes#diff-9d596db608948d6ebe428c69c82e287baef15e50f9382a701d7eb98260e51aff approach with this suggested approach. But for clean/ECTR, this doesn't fully address the issue (unlike empty clean), since we need to make sure the instant corresponding to the ECTR is getting "progressed" even if there is nothing to clean. In order to minimize chance of full table scan clean (if archival were to run before ECTR progressed) and avoid incremental clean repeatedly re-reading the same instants from last ECTR (which takes up time). My first thought is that in order to address this (clean) issue we would need to either a. Implement empty clean. But unlike our org's empty clean approach, when we port it to upstream we can remove the requirement of archival needing to block for ECTR. And we can discuss that guardrail separately, since even empty clean by itself will mostly solve the aforementioned incremental cleaning gaps (if the user is reliably attempting clean on a regular cadence and monitoring for transient failures). OR b. We can build on your suggestion here (of storing the ECTR in every (delta)commit/replacecommit write): when committing the instant the writer itself can update the ECTR if it "knows" that there is nothing behind left to clean since last ECTR, based on instant metadata. Specifically, it will check the instant file metadata of the ECTR and every instant after, and if there are no compacted/replaced/updated( base files), then it will bump up the ECTR as far as it can. I assume the community would not want to go with this approach, since it duplicates some clean logic in ingestion write path and couples clean with ingestion write logic. But just wanted to share it as an alternative -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
