[jira] [Assigned] (FLINK-16357) Extend Checkpoint Coordinator to differentiate between "regional restore" and "full restore".

Stephan Ewen (Jira) Thu, 14 May 2020 01:47:41 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-16357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Stephan Ewen reassigned FLINK-16357:
------------------------------------

    Assignee: Stephan Ewen

> Extend Checkpoint Coordinator to differentiate between "regional restore" and 
> "full restore".
> ---------------------------------------------------------------------------------------------
>
>                 Key: FLINK-16357
>                 URL: https://issues.apache.org/jira/browse/FLINK-16357
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Checkpointing
>            Reporter: Stephan Ewen
>            Assignee: Stephan Ewen
>            Priority: Major
>             Fix For: 1.11.0
>
>
> The {{ExecutionGraph}} has the notion of "global failure" (failing the entire 
> execution graph) and "regional failure" (recover a region with transient 
> pipelined data exchanges).
> The latter one is for common failover, the former one is a safety net to 
> handle unexpected failures or inconsistencies (full reset of ExecutionGraph 
> recovers most inconsistencies).
> The OperatorCoordinators should only be reset to a checkpoint in the "global 
> failover" case. In the "regional failover" case, they are only notified of 
> the tasks that are reset and keep their internal state and adjust it for the 
> failed tasks.
> To implement that, the ExecutionGraph needs to forward the information about 
> whether we are restoring from a "regional failure" or from a "global failure".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (FLINK-16357) Extend Checkpoint Coordinator to differentiate between "regional restore" and "full restore".

Reply via email to