Ufuk Celebi created FLINK-2287:
----------------------------------
Summary: Implement JobManager high availability
Key: FLINK-2287
URL: https://issues.apache.org/jira/browse/FLINK-2287
Project: Flink
Issue Type: Improvement
Components: JobManager, TaskManager
Reporter: Ufuk Celebi
Fix For: 0.10
The problem: The JobManager (JM) is a single point of failure. When it crashes,
TaskManagers (TM) fail all running jobs and try to reconnect to the same JM. A
failed JM looses all state and can not resume the running jobs; even if it
recovers and the TMs reconnect.
Solution: implement JM fault tolerance/high availability by having multiple JM
instances running with one as leader and the other(s) in standby. The exact
coordination and state update protocol between JM, TM, and clients is covered
in sub-tasks/issues.
Related Wiki:
https://cwiki.apache.org/confluence/display/FLINK/JobManager+High+Availability
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)