[
https://issues.apache.org/jira/browse/FLINK-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103436#comment-15103436
]
Eron Wright commented on FLINK-2287:
-------------------------------------
A nice follow-on here would be to improve client discovery of the JobManager.
For example, the flink CLI could accept a list of JobManager endpoints or a ZK
endpoint with which to discover the active JobManager.
In the meantime, what is the behavior of the client when it connects to a
standby JobManager?
> Implement JobManager high availability
> --------------------------------------
>
> Key: FLINK-2287
> URL: https://issues.apache.org/jira/browse/FLINK-2287
> Project: Flink
> Issue Type: Improvement
> Components: JobManager, TaskManager
> Reporter: Ufuk Celebi
> Fix For: 0.10.0
>
>
> The problem: The JobManager (JM) is a single point of failure. When it
> crashes, TaskManagers (TM) fail all running jobs and try to reconnect to the
> same JM. A failed JM looses all state and can not resume the running jobs;
> even if it recovers and the TMs reconnect.
> Solution: implement JM fault tolerance/high availability by having multiple
> JM instances running with one as leader and the other(s) in standby. The
> exact coordination and state update protocol between JM, TM, and clients is
> covered in sub-tasks/issues.
> Related Wiki:
> https://cwiki.apache.org/confluence/display/FLINK/JobManager+High+Availability
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)