voonhous commented on PR #13285:
URL: https://github.com/apache/hudi/pull/13285#issuecomment-2874786836

   IIUC, there are 6 scenarios:
   
   1. **Happy path 1**: No failures
   2. **Happy path 2**: Empty Checkpoint (No Data from Any Operator)
       - Flink triggers a checkpoint `chkId`, but no data has arrived at any 
operator since the last one. All operators send WriteMetadataEvent with empty 
WriteStatus lists. allowCommitOnEmptyBatch is false. 
   3. **Edge case 1**: Operator Restarts (Bootstrap Path) where:
      - A specific operator task fails and is restarted by Flink. The 
Coordinator and other operators remain running. The restarted operator needs to 
sync up.
   4. **Edge case 2**: Checkpoint Fails Globally After Operator Sent Metadata 
where:
       - Operator snapshots successfully, sends WriteMetadataEvent for `chkId` 
and `InstX`. Flink begins committing the checkpoint `chkId`, but it fails 
globally (e.g., another operator fails, JM fails).
   5. **Edge case 3**: Scenario 3: Commit Fails in Coordinator where:
       - Coordinator receives all WriteMetadataEvents for `chkId`, Flink calls 
`notifyCheckpointComplete(chkId)`. Coordinator attempts 
`writeClient.commit(InstX)` but it fails
   6. **Edge case 4**: Commit Acknowledgment (ACK) Lost where:
       - Coordinator successfully commits `InstX`, sends CommitAckEvent to 
Operator N, but the network fails or the Operator processes it slowly and Flink 
triggers the next checkpoint (`chkId+1`) before the Operator unblocks. Operator 
proceeds, and does an empty commit. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to