kbuci commented on issue #17844:
URL: https://github.com/apache/hudi/issues/17844#issuecomment-3874199431

   Thanks for the suggestion, I had a chat with @nsivabalan as well on this and 
I think this approach (putting the extraMetadata/checkpoint info from last 
ingestion write in each commit/replacecommit instant regardless of operation 
type) solves the concern of "losing" ingestion checkpoint info. So we can 
replace the "empty" commit 
https://github.com/apache/hudi/pull/11606/changes#diff-9d596db608948d6ebe428c69c82e287baef15e50f9382a701d7eb98260e51aff
 approach with this suggested approach.
   
   But for clean/ECTR, this doesn't fully address the issue (unlike empty 
clean), since we need to make sure the instant corresponding to the ECTR is 
getting "progressed" even if there is nothing to clean. In order to minimize 
chance of full table scan clean (if archival were to run before ECTR 
progressed) and avoid incremental clean repeatedly re-reading the same instants 
from last ECTR (which takes up time). My first thought is that in order to 
address this (clean) issue we would need to either
   
   a. Implement empty clean. But unlike our org's empty clean approach, when we 
port it to upstream we can remove the requirement of archival needing to block 
for ECTR. And we can discuss that guardrail separately, since even empty clean 
by itself will mostly solve the aforementioned incremental cleaning gaps (if 
the user is reliably attempting clean on a regular cadence and monitoring for 
transient failures). 
   
   OR
   
   b. We can build on your suggestion here (of storing the ECTR in every 
(delta)commit/replacecommit write): when committing the instant the writer 
itself can update the ECTR if it "knows" that there is nothing behind left to 
clean since last ECTR, based on instant metadata. Specifically, it will check 
the instant file metadata of the ECTR and every instant after, and if there are 
no compacted/replaced/updated( base files), then it will bump up the ECTR as 
far as it can. I assume the community would not want to go with this approach, 
since it duplicates some clean logic in ingestion write path and couples clean 
with ingestion write logic. But just wanted to share it as an alternative  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to