[I] Fix lost re-commit in rare restart case (changing write.tasks) [hudi]

via GitHub Sat, 29 Nov 2025 21:09:21 -0800


hudi-bot opened a new issue, #15309:
URL: https://github.com/apache/hudi/issues/15309


   The current _*StreamWriteOperatorCoordinator*_ in Flink will try to 
re-commit the last batch during restart. And users may rely on this behavior to 
achieve exactly-once semantic when reading from source like Kafka.
   
    
   
   However, the re-commit operation may be skipped if the user set a different 
_write.tasks_ parameter in restart, because the current implementation keeps 
_write.tasks_ number of slots to track events from subtasks and does not handle 
the case where the write parallelism changes.
   
    
   
   For example:
    # Start with _write.tasks = 4_ and the application crashes (or stops) right 
after the Flink checkpoint (e.g., CkpId=1) while before the hudi commit.
    # Restart with _write.tasks = 8_ and the coordinator will receive 4 
restored bootstrap metadata event and 4 empty bootstrap event. Since the 
arrival order of these events is not deterministic, so the coordinator may not 
re-commit the last commit.
    # The source (e.g., Kafka reader) use checkpointId to guide its 
consumption. So in the restart, it will read at the next offset given by 
{_}CkpId=1{_}. Then we will lost all data in hudi for the batch (i.e., Ckp=1).
   
   Similar problem also happens for having smaller _write.tasks_ during 
restart, e.g., 4 -> 2.
   
    
   
   This Jira will fix the implementation to ensure the re-commit will be done 
for changing _write.tasks_ case. Though the exactly-once semantic could be 
fixed by changing the reader side (e.g., track ckpId in hudi commit data and 
use it to guide the reader), it requires hudi users to change their application 
code.
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-4521
   - Type: Bug


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Fix lost re-commit in rare restart case (changing write.tasks) [hudi]

Reply via email to