[jira] [Commented] (FLINK-4911) Non-disruptive JobManager Failures via Reconciliation

Zhijiang Wang (JIRA) Tue, 25 Oct 2016 21:06:06 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-4911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15607336#comment-15607336
 ]


Zhijiang Wang commented on FLINK-4911:
--------------------------------------

FLINK-3539 tries to realize eventual consistency between JM and TMs in regular 
running. 
In JM failure scenario, the new JM wants to recover from the initial state 
based on all running tasks. 
Maybe we can refer to the running task informations from heartbeat messages and 
apply these infos for registration process.

> Non-disruptive JobManager Failures via Reconciliation 
> ------------------------------------------------------
>
>                 Key: FLINK-4911
>                 URL: https://issues.apache.org/jira/browse/FLINK-4911
>             Project: Flink
>          Issue Type: New Feature
>          Components: JobManager
>            Reporter: Stephan Ewen
>
> JobManager failures can be handled in a non-disruptive way - by *reconciling* 
> the new JobManager leader and the TaskManagers.
> I suggest to use this term (reconcile)  - it has been uses also by other 
> frameworks (like Mesos) for non-disruptive handling of failures.
> The basic approach is the following:
>   - When a JobManager fails, TaskManagers do not cancel tasks, but attempt to 
> reconnect to the JobManager
>   - On connect, the TaskManager tells the JobManager about its currently 
> running tasks
>   - A new JobManager waits for TaskManagers to connect and report a task 
> status. It re-constructs the ExecutionGraph state from these reports
>   - Tasks whose status was not reconstructed in a certain time are assumed 
> failed and trigger regular task recovery.
> To avoid having to re-implement this for the new JobManager / TaskManager 
> approach in *flip-6*, I suggest to directly implement this into the 
> {{flip-6}} feature branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-4911) Non-disruptive JobManager Failures via Reconciliation

Reply via email to