[
https://issues.apache.org/jira/browse/MAPREDUCE-4157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13397033#comment-13397033
]
Jason Lowe commented on MAPREDUCE-4157:
---------------------------------------
It's not part of MAPREDUCE-3614, rather it came out of the work on
MAPREDUCE-4099. When the AM unregisters with the RM, there's a race between
the AM finishing normally on its own and the RM killing the AM as part of
killing all containers for the application. If the AM is performing cleanup
duties that aren't critical to the success/failure of the application then it
would be nice if the AM was given time to do this before the RM kills it as a
side-effect of the unregister.
The AM could move the cleanup to before the unregister, but if the AM
fails/dies/hangs during the cleanup the RM will attempt to restart the AM
thinking the job did not complete successfully even though the client has
already been notified of the success. And if the staging directory was removed
as part of the cleanup, restarting will fail and the job will be marked by the
RM as failed but the client thought it succeeded.
This change doesn't eliminate all of the race conditions (the AM could fail
after the client is notified but before unregistering with the RM), but it does
eliminate a race between the AM shutting down cleanly and the RM trying to kill
it.
> ResourceManager should not kill apps that are well behaved
> ----------------------------------------------------------
>
> Key: MAPREDUCE-4157
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4157
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: mrv2
> Affects Versions: 2.0.0-alpha
> Reporter: Jason Lowe
> Assignee: Jason Lowe
> Attachments: MAPREDUCE-4157.patch
>
>
> Currently when the ApplicationMaster unregisters with the ResourceManager,
> the RM kills (via the NMs) all the active containers for an application.
> This introduces a race where the AM may be trying to clean up and may not
> finish before it is killed. The RM should give the AM a chance to exit
> cleanly on its own rather than always race with a pending kill on shutdown.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira