[jira] [Updated] (FLINK-4911) Non-disruptive JobManager Failures

Stephan Ewen (JIRA) Tue, 25 Oct 2016 07:50:22 -0700

     [ 
https://issues.apache.org/jira/browse/FLINK-4911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Stephan Ewen updated FLINK-4911:
--------------------------------
    Description: 
JobManager failures can be handled in a non-disruptive way - by *reconciling* 
the new JobManager leader and the TaskManagers.
I suggest to use this term (reconcile)  - it has been uses also by other 
frameworks (like Mesos) for non-disruptive handling of failures.

The basic approach is the following:

  - When a JobManager fails, TaskManagers do not cancel tasks, but attempt to 
reconnect to the JobManager
  - On connect, the TaskManager tells the JobManager about its currently 
running tasks
  - A new JobManager waits for TaskManagers to connect and report a task 
status. It re-constructs the ExecutionGraph state from these reports
  - Tasks whose status was not reconstructed in a certain time are assumed 
failed and trigger regular task recovery.

To avoid having to re-implement this for the new JobManager / TaskManager 
approach in *flip-6*, I suggest to directly implement this into the {{flip-6}} 
feature branch.

  was:
JobManager failures can be handled in a non-disruptive way.

The basic approach is the following:

  - When a JobManager fails, TaskManagers do not cancel tasks, but attempt to 
reconnect to the JobManager
  - On connect, the TaskManager tells the JobManager about its currently 
running tasks
  - A new JobManager waits for TaskManagers to connect and report a task 
status. It re-constructs the ExecutionGraph state from these reports
  - Tasks whose status was not reconstructed in a certain time are assumed 
failed and trigger regular task recovery.

To avoid having to re-implement this for the new JobManager / TaskManager 
approach in **flip-6**, I suggest to directly implement this into the 
{{flip-6}} feature branch.


> Non-disruptive JobManager Failures
> ----------------------------------
>
>                 Key: FLINK-4911
>                 URL: https://issues.apache.org/jira/browse/FLINK-4911
>             Project: Flink
>          Issue Type: New Feature
>          Components: JobManager
>            Reporter: Stephan Ewen
>
> JobManager failures can be handled in a non-disruptive way - by *reconciling* 
> the new JobManager leader and the TaskManagers.
> I suggest to use this term (reconcile)  - it has been uses also by other 
> frameworks (like Mesos) for non-disruptive handling of failures.
> The basic approach is the following:
>   - When a JobManager fails, TaskManagers do not cancel tasks, but attempt to 
> reconnect to the JobManager
>   - On connect, the TaskManager tells the JobManager about its currently 
> running tasks
>   - A new JobManager waits for TaskManagers to connect and report a task 
> status. It re-constructs the ExecutionGraph state from these reports
>   - Tasks whose status was not reconstructed in a certain time are assumed 
> failed and trigger regular task recovery.
> To avoid having to re-implement this for the new JobManager / TaskManager 
> approach in *flip-6*, I suggest to directly implement this into the 
> {{flip-6}} feature branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (FLINK-4911) Non-disruptive JobManager Failures

Reply via email to