voonhous commented on PR #13285:
URL: https://github.com/apache/hudi/pull/13285#issuecomment-2874786836
IIUC, there are 6 scenarios:
1. **Happy path 1**: No failures
2. **Happy path 2**: Empty Checkpoint (No Data from Any Operator)
- Flink triggers a checkpoint `chkId`, but no data has arrived at any
operator since the last one. All operators send WriteMetadataEvent with empty
WriteStatus lists. allowCommitOnEmptyBatch is false.
3. **Edge case 1**: Operator Restarts (Bootstrap Path) where:
- A specific operator task fails and is restarted by Flink. The
Coordinator and other operators remain running. The restarted operator needs to
sync up.
4. **Edge case 2**: Checkpoint Fails Globally After Operator Sent Metadata
where:
- Operator snapshots successfully, sends WriteMetadataEvent for `chkId`
and `InstX`. Flink begins committing the checkpoint `chkId`, but it fails
globally (e.g., another operator fails, JM fails).
5. **Edge case 3**: Scenario 3: Commit Fails in Coordinator where:
- Coordinator receives all WriteMetadataEvents for `chkId`, Flink calls
`notifyCheckpointComplete(chkId)`. Coordinator attempts
`writeClient.commit(InstX)` but it fails
6. **Edge case 4**: Commit Acknowledgment (ACK) Lost where:
- Coordinator successfully commits `InstX`, sends CommitAckEvent to
Operator N, but the network fails or the Operator processes it slowly and Flink
triggers the next checkpoint (`chkId+1`) before the Operator unblocks. Operator
proceeds, and does an empty commit.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]