[
https://issues.apache.org/jira/browse/FLINK-4911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stephan Ewen updated FLINK-4911:
--------------------------------
Description:
JobManager failures can be handled in a non-disruptive way - by *reconciling*
the new JobManager leader and the TaskManagers.
I suggest to use this term (reconcile) - it has been uses also by other
frameworks (like Mesos) for non-disruptive handling of failures.
The basic approach is the following:
- When a JobManager fails, TaskManagers do not cancel tasks, but attempt to
reconnect to the JobManager
- On connect, the TaskManager tells the JobManager about its currently
running tasks
- A new JobManager waits for TaskManagers to connect and report a task
status. It re-constructs the ExecutionGraph state from these reports
- Tasks whose status was not reconstructed in a certain time are assumed
failed and trigger regular task recovery.
To avoid having to re-implement this for the new JobManager / TaskManager
approach in *flip-6*, I suggest to directly implement this into the {{flip-6}}
feature branch.
was:
JobManager failures can be handled in a non-disruptive way.
The basic approach is the following:
- When a JobManager fails, TaskManagers do not cancel tasks, but attempt to
reconnect to the JobManager
- On connect, the TaskManager tells the JobManager about its currently
running tasks
- A new JobManager waits for TaskManagers to connect and report a task
status. It re-constructs the ExecutionGraph state from these reports
- Tasks whose status was not reconstructed in a certain time are assumed
failed and trigger regular task recovery.
To avoid having to re-implement this for the new JobManager / TaskManager
approach in **flip-6**, I suggest to directly implement this into the
{{flip-6}} feature branch.
> Non-disruptive JobManager Failures
> ----------------------------------
>
> Key: FLINK-4911
> URL: https://issues.apache.org/jira/browse/FLINK-4911
> Project: Flink
> Issue Type: New Feature
> Components: JobManager
> Reporter: Stephan Ewen
>
> JobManager failures can be handled in a non-disruptive way - by *reconciling*
> the new JobManager leader and the TaskManagers.
> I suggest to use this term (reconcile) - it has been uses also by other
> frameworks (like Mesos) for non-disruptive handling of failures.
> The basic approach is the following:
> - When a JobManager fails, TaskManagers do not cancel tasks, but attempt to
> reconnect to the JobManager
> - On connect, the TaskManager tells the JobManager about its currently
> running tasks
> - A new JobManager waits for TaskManagers to connect and report a task
> status. It re-constructs the ExecutionGraph state from these reports
> - Tasks whose status was not reconstructed in a certain time are assumed
> failed and trigger regular task recovery.
> To avoid having to re-implement this for the new JobManager / TaskManager
> approach in *flip-6*, I suggest to directly implement this into the
> {{flip-6}} feature branch.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)