[
https://issues.apache.org/jira/browse/FLINK-16357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17048870#comment-17048870
]
Stephan Ewen commented on FLINK-16357:
--------------------------------------
Yes, {{OperatorCoordinator#resetToCheckpoint(...)}} is expected to be invoked
in {{CheckpointCoordinator#restoreLatestCheckpointedState(...)}}, iff a
failure/recovery came from {{ExecutionGraph.failGlobal()}} or
{{SchedulerNG.handleGlobalFailure()}}.
Currently, if we would call {{OperatorCoordinator#resetToCheckpoint(...)}}
within {{CheckpointCoordinator#restoreLatestCheckpointedState(...)}} we would
restore it on every regional failover as well, if I read the code correctly.
The {{OperatorCoordinator}} exists once per {{ExecutionJobVertex}}, not per
each {{ExecutionVertex}}.
> Extend Checkpoint Coordinator to differentiate between "regional restore" and
> "full restore".
> ---------------------------------------------------------------------------------------------
>
> Key: FLINK-16357
> URL: https://issues.apache.org/jira/browse/FLINK-16357
> Project: Flink
> Issue Type: Sub-task
> Components: Runtime / Checkpointing
> Reporter: Stephan Ewen
> Priority: Major
> Fix For: 1.11.0
>
>
> The {{ExecutionGraph}} has the notion of "global failure" (failing the entire
> execution graph) and "regional failure" (recover a region with transient
> pipelined data exchanges).
> The latter one is for common failover, the former one is a safety net to
> handle unexpected failures or inconsistencies (full reset of ExecutionGraph
> recovers most inconsistencies).
> The OperatorCoordinators should only be reset to a checkpoint in the "global
> failover" case. In the "regional failover" case, they are only notified of
> the tasks that are reset and keep their internal state and adjust it for the
> failed tasks.
> To implement that, the ExecutionGraph needs to forward the information about
> whether we are restoring from a "regional failure" or from a "global failure".
--
This message was sent by Atlassian Jira
(v8.3.4#803005)