[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13397033#comment-13397033
 ] 

Jason Lowe commented on MAPREDUCE-4157:
---------------------------------------

It's not part of MAPREDUCE-3614, rather it came out of the work on 
MAPREDUCE-4099.  When the AM unregisters with the RM, there's a race between 
the AM finishing normally on its own and the RM killing the AM as part of 
killing all containers for the application.  If the AM is performing cleanup 
duties that aren't critical to the success/failure of the application then it 
would be nice if the AM was given time to do this before the RM kills it as a 
side-effect of the unregister.

The AM could move the cleanup to before the unregister, but if the AM 
fails/dies/hangs during the cleanup the RM will attempt to restart the AM 
thinking the job did not complete successfully even though the client has 
already been notified of the success.  And if the staging directory was removed 
as part of the cleanup, restarting will fail and the job will be marked by the 
RM as failed but the client thought it succeeded.

This change doesn't eliminate all of the race conditions (the AM could fail 
after the client is notified but before unregistering with the RM), but it does 
eliminate a race between the AM shutting down cleanly and the RM trying to kill 
it.
                
> ResourceManager should not kill apps that are well behaved
> ----------------------------------------------------------
>
>                 Key: MAPREDUCE-4157
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4157
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mrv2
>    Affects Versions: 2.0.0-alpha
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: MAPREDUCE-4157.patch
>
>
> Currently when the ApplicationMaster unregisters with the ResourceManager, 
> the RM kills (via the NMs) all the active containers for an application.  
> This introduces a race where the AM may be trying to clean up and may not 
> finish before it is killed.  The RM should give the AM a chance to exit 
> cleanly on its own rather than always race with a pending kill on shutdown.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to