[jira] [Commented] (MAPREDUCE-3274) Race condition in MR App Master Preemtion can cause a dead lock

Robert Joseph Evans (Commented) (JIRA) Thu, 27 Oct 2011 07:40:53 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13137182#comment-13137182
 ]


Robert Joseph Evans commented on MAPREDUCE-3274:
------------------------------------------------

Yes the JVM thing was a red herring.

The issue is that on the AM.  The events were processed in the following order
CONTINER_REMOTE_LAUNCH
TA_KILL

But on the NM they were processed in reverse.
Stop Container Request (Error)
Start Container Request (Success)

The Stop Request was processed 4 ms before the Start Request was.


I need to read through the code some more to try to understand how to handle 
this.  Just my gut feeling would be that we need a way to handle an error in a 
Stop Container Request.  We may need an event back indicating that the TA_KILL 
failed. Perhaps we could retry it a few times before giving up instead of the 
event back.

Also the container launched and started talking to the App Master requesting 
something to do.  The App Master always responded with I have nothing for you 
to do.  It might be good if the code that informs the Container what to do 
could know about killed attempts and if for some reason they ask for something 
to do they are told to die.  This seems like a good way to prevent this type of 
error in the future.
                
> Race condition in MR App Master Preemtion can cause a dead lock
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-3274
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3274
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, scheduler
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>            Priority: Critical
>             Fix For: 0.23.0, 0.24.0
>
>
> There appears to be a race condition in the MR App Master in relation to 
> preempting reducers to let a mapper run.  In the particular case that I have 
> been debugging a reducer was selected for preemption that did not have a 
> container assigned to it yet. When the container became available that reduce 
> started running and the previous TA_KILL event appears to have been ignored.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3274) Race condition in MR App Master Preemtion can cause a dead lock

Reply via email to