[
https://issues.apache.org/jira/browse/FLINK-16357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stephan Ewen reassigned FLINK-16357:
------------------------------------
Assignee: Stephan Ewen
> Extend Checkpoint Coordinator to differentiate between "regional restore" and
> "full restore".
> ---------------------------------------------------------------------------------------------
>
> Key: FLINK-16357
> URL: https://issues.apache.org/jira/browse/FLINK-16357
> Project: Flink
> Issue Type: Sub-task
> Components: Runtime / Checkpointing
> Reporter: Stephan Ewen
> Assignee: Stephan Ewen
> Priority: Major
> Fix For: 1.11.0
>
>
> The {{ExecutionGraph}} has the notion of "global failure" (failing the entire
> execution graph) and "regional failure" (recover a region with transient
> pipelined data exchanges).
> The latter one is for common failover, the former one is a safety net to
> handle unexpected failures or inconsistencies (full reset of ExecutionGraph
> recovers most inconsistencies).
> The OperatorCoordinators should only be reset to a checkpoint in the "global
> failover" case. In the "regional failover" case, they are only notified of
> the tasks that are reset and keep their internal state and adjust it for the
> failed tasks.
> To implement that, the ExecutionGraph needs to forward the information about
> whether we are restoring from a "regional failure" or from a "global failure".
--
This message was sent by Atlassian Jira
(v8.3.4#803005)