[
https://issues.apache.org/jira/browse/FLINK-4911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15607336#comment-15607336
]
Zhijiang Wang commented on FLINK-4911:
--------------------------------------
FLINK-3539 tries to realize eventual consistency between JM and TMs in regular
running.
In JM failure scenario, the new JM wants to recover from the initial state
based on all running tasks.
Maybe we can refer to the running task informations from heartbeat messages and
apply these infos for registration process.
> Non-disruptive JobManager Failures via Reconciliation
> ------------------------------------------------------
>
> Key: FLINK-4911
> URL: https://issues.apache.org/jira/browse/FLINK-4911
> Project: Flink
> Issue Type: New Feature
> Components: JobManager
> Reporter: Stephan Ewen
>
> JobManager failures can be handled in a non-disruptive way - by *reconciling*
> the new JobManager leader and the TaskManagers.
> I suggest to use this term (reconcile) - it has been uses also by other
> frameworks (like Mesos) for non-disruptive handling of failures.
> The basic approach is the following:
> - When a JobManager fails, TaskManagers do not cancel tasks, but attempt to
> reconnect to the JobManager
> - On connect, the TaskManager tells the JobManager about its currently
> running tasks
> - A new JobManager waits for TaskManagers to connect and report a task
> status. It re-constructs the ExecutionGraph state from these reports
> - Tasks whose status was not reconstructed in a certain time are assumed
> failed and trigger regular task recovery.
> To avoid having to re-implement this for the new JobManager / TaskManager
> approach in *flip-6*, I suggest to directly implement this into the
> {{flip-6}} feature branch.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)