[jira] [Created] (FLINK-2287) Implement JobManager high availability

Ufuk Celebi (JIRA) Mon, 29 Jun 2015 01:53:10 -0700

Ufuk Celebi created FLINK-2287:
----------------------------------

             Summary: Implement JobManager high availability
                 Key: FLINK-2287
                 URL: https://issues.apache.org/jira/browse/FLINK-2287
             Project: Flink
          Issue Type: Improvement
          Components: JobManager, TaskManager
            Reporter: Ufuk Celebi
             Fix For: 0.10



The problem: The JobManager (JM) is a single point of failure. When it crashes, 
TaskManagers (TM) fail all running jobs and try to reconnect to the same JM. A 
failed JM looses all state and can not resume the running jobs; even if it 
recovers and the TMs reconnect.

Solution: implement JM fault tolerance/high availability by having multiple JM 
instances running with one as leader and the other(s) in standby. The exact 
coordination and state update protocol between JM, TM, and clients is covered 
in sub-tasks/issues.

Related Wiki: 
https://cwiki.apache.org/confluence/display/FLINK/JobManager+High+Availability



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-2287) Implement JobManager high availability

Reply via email to