[ 
https://issues.apache.org/jira/browse/FLINK-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277385#comment-14277385
 ] 

ASF GitHub Bot commented on FLINK-1376:
---------------------------------------

GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/309

    [FLINK-1376] [runtime] Add proper shared slot release in case of a fatal 
TaskManager failure

    This PR introduces SharedSlots as being a special Slot type and as such 
being released properly in case an Instance has been marked dead. This fixes 
the problem that a dead instance, which has not been shutdown properly, causes 
a job not being removed properly from the system, because it is not aware of 
the SubSlots.
    
    Adds test cases where only the task manager is killed by a Kill message 
(hard shutdown)
    
    @StephanEwen: Requires thorough review because it touches some delicate 
scheduling/slot logic.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink fixSharedSlotReleaseAkka

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/309.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #309
    
----
commit ba1dd8b2ce956eb1b14a0ca458a3ca5240da0aee
Author: Till Rohrmann <trohrm...@apache.org>
Date:   2015-01-12T09:58:45Z

    [FLINK-1376] [runtime] Add proper shared slot release in case of a fatal 
TaskManager failure.
    
    Fixes concurrent modification exception of SharedSlot's subSlots field by 
synchronizing all state changing operations through the associated assignment 
group. Fixes deadlock where Instance.markDead first acquires InstanceLock and 
then by releasing the associated slots the assignment group lockcan block with 
a direct releaseSlot call on a SharedSlot which first acquires the assignment 
group lock and then the instance lock in order to return the slot to the 
instance.

----


> SubSlots are not properly released in case that a TaskManager fatally fails, 
> leaving the system in a corrupted state
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-1376
>                 URL: https://issues.apache.org/jira/browse/FLINK-1376
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Till Rohrmann
>
> In case that the TaskManager fatally fails and some of the failing node's 
> slots are SharedSlots, then the slots are not properly released by the 
> JobManager. This causes that the corresponding job will not be properly 
> failed, leaving the system in a corrupted state.
> The reason for that is that the AllocatedSlot is not aware of being treated 
> as a SharedSlot and thus he cannot release the associated SubSlots.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to