GitHub user StefanRRichter opened a pull request: https://github.com/apache/flink/pull/5403
[WIP] Reschedule failed tasks to previous allocation (if possible). This PR is a preview for early feedback on scheduler changes that make allocations sticky under failures for local recovery. Core idea is that we consider previous allocations in our scheduling and all tasks obey to the following rule: If there was a previous allocation, try to find the same slot again or request a new slot, that cannot be owned by another task. We do this to prevent task that cannot find their previous slot (e.g. machine failure) from stealing the previous slot from another failed task, that could otherwise recover locally. `SlotProfile` specifies now all requirements for a slot and a matcher is used to identify the right candidate. CC @tillrohrmann You can merge this pull request into a Git repository by running: $ git pull https://github.com/StefanRRichter/flink task-local-recovery-scheduler-wip Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/5403.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5403 ---- commit 4dd57b17f24e7611ba0b68f90a6c1593cd496225 Author: Stefan Richter <s.richter@...> Date: 2018-02-01T15:02:28Z [WIP] Reschedule failed tasks to previous allocation (if possible). ---- ---