GitHub user StefanRRichter opened a pull request:

    https://github.com/apache/flink/pull/5403

    [WIP] Reschedule failed tasks to previous allocation (if possible).

    This PR is a preview for early feedback on scheduler changes that make 
allocations sticky under failures for local recovery.
    
    Core idea is that we consider previous allocations in our scheduling and 
all tasks obey to the following rule:
    
    If there was a previous allocation, try to find the same slot again or 
request a new slot, that cannot be owned by another task. We do this to prevent 
task that cannot find their previous slot (e.g. machine failure) from stealing 
the previous slot from another failed task, that could otherwise recover 
locally.
    
    `SlotProfile` specifies now all requirements for a slot and a matcher is 
used to identify the right candidate.
    
    CC @tillrohrmann 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/StefanRRichter/flink 
task-local-recovery-scheduler-wip

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/5403.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5403
    
----
commit 4dd57b17f24e7611ba0b68f90a6c1593cd496225
Author: Stefan Richter <s.richter@...>
Date:   2018-02-01T15:02:28Z

    [WIP] Reschedule failed tasks to previous allocation (if possible).

----


---

Reply via email to