[
https://issues.apache.org/jira/browse/MAPREDUCE-4611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Joseph Evans updated MAPREDUCE-4611:
-------------------------------------------
Attachment: MR-4611.txt
This patch makes the changes to only cleanup when the job has finished, or when
it is the last retry for the AM.
I have manually tested this in addition to adding in the unit tests.
> MR AM dies badly when Node is decomissioned
> -------------------------------------------
>
> Key: MAPREDUCE-4611
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4611
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Affects Versions: 0.23.3, 2.0.0-alpha, 3.0.0
> Reporter: Robert Joseph Evans
> Assignee: Robert Joseph Evans
> Attachments: MR-4611.txt
>
>
> The MR AM always thinks that it is being killed by the RM when it gets a kill
> signal and it has not finished processing yet. In reality the RM kill signal
> is only sent when the client cannot communicate directly with the AM, which
> probably means that the AM is in a bad state already. The much more common
> case is that the node is marked as unhealthy or decomissioned.
> I propose that in the short term the AM will only clean up if
> # The process has been asked by the client to exit (kill)
> # The process job has finished cleanly and is exiting already
> # This is that last retry of the AM retries.
> The downside here is that the .staging directory will be leaked and the job
> will not show up in the history server on an kill from the RM in some cases.
> At least until the full set of AM cleanup issues can be addressed, probably
> as part of MAPREDUCE-4428
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira