[
https://issues.apache.org/jira/browse/MAPREDUCE-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13263903#comment-13263903
]
Thomas Graves commented on MAPREDUCE-4152:
------------------------------------------
It looks like the best way to have the app master clean up any containers that
aren't completed is to do it via a service. I first investigating just having
the job send the kill event when it transitioned to the ERROR state but it
sends a kill event to each task, which then each task has to send a kill event
to the task attempt, and then the task attempt send an event to the container
launcher to tell the node manager to kill the container. That is a lot to do
and I don't want to have the job wait for that to happen since its an error
state. If you don't wait, then you have a race as to whether everything is
actually processed. The other issue with sending events is that the final
jobfinish event is handled by the same async dispatcher so it will be busy
finishing/shutting down and won't process any further events. So it seems it
has to be done by a service during the stop call. The container launcher
already knows what containers it has that aren't complete so I chose to have
the container launcher kill any containers that haven't completed when its stop
routine is called. The bad part is the container launcher didn't have all the
information required to actually kill the container so I had to add it, which
I'm not completely happy with but seemed the best fit.
I will attach a patch shortly.
> map task left hanging after AM dies trying to connect to RM
> -----------------------------------------------------------
>
> Key: MAPREDUCE-4152
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4152
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 0.23.2
> Reporter: Thomas Graves
> Assignee: Thomas Graves
> Attachments: MAPREDUCE-4152.patch
>
>
> We had an instance where the RM went down for more then an hour. The
> application master exited with "Could not contact RM after 360000
> milliseconds"
> 2012-04-11 10:43:36,040 INFO [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl:
> job_1333003059741_15999Job Transitioned from RUNNING to ERROR
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira