[
https://issues.apache.org/jira/browse/FLINK-16986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stephan Ewen closed FLINK-16986.
--------------------------------
> Enhance the OperatorEvent handling guarantee during checkpointing.
> ------------------------------------------------------------------
>
> Key: FLINK-16986
> URL: https://issues.apache.org/jira/browse/FLINK-16986
> Project: Flink
> Issue Type: Sub-task
> Components: Runtime / Checkpointing
> Reporter: Jiangjie Qin
> Assignee: Stephan Ewen
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 1.11.0
>
>
> When the {{CheckpointCoordinator}} takes a checkpoint, the checkpointing
> order is following:
> # {{CheckpointCoordinator}} triggers checkpoint on each
> {{OperatorCoordinator}}
> # Each {{OperatorCoordinator}} takes a snapshot.
> # Right after taking the snapshot, the {{CheckpointCoordinator}} sends a
> {{CHECKPOINT_FIN}} marker through the {{OperatorContext}}.
> # Once the {{OperatorContext}} sees {{CHECKPOINT_FIN}} marker, it will wait
> for all the previous events are acked and suspend the event gateway to the
> operators by buffering the future {{OperatorEvents}} sent from the
> {{OperatorCoordinator}} to the operators without actually sending them out.
> # The {{CheckpointCoordinator}} waits until all the {{OperatorCoordinator}}s
> finish step 2-4 and then triggers the task snapshots.
> # The suspension of an event gateway to an operator can be lifted after all
> the subtasks of that operator has finished their task checkpoint.
> The mechanism above guarantees all the {{OperatorEvents}} sent before taking
> the operator coordinator snapshot are handled by the operator before the task
> snapshots are taken.
> An operator can use this mechanism to know whether an {{OperatorEvent}} it
> sent to the coordinator is included in the upcoming checkpoint or not. What
> it has to do is to ask the operator coordinator to ACK that OperatorEvent. If
> the ACK is received before the operator takes the next snapshot, that
> OperatorEvent must have been handled and checkpointed by the
> OperatorCoordinator.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)