[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13263903#comment-13263903
 ] 

Thomas Graves commented on MAPREDUCE-4152:
------------------------------------------

It looks like the best way to have the app master clean up any containers that 
aren't completed is to do it via a service. I first investigating just having 
the job send the kill event when it transitioned to the ERROR state but it 
sends a kill event to each task, which then each task has to send a kill event 
to the task attempt, and then the task attempt send an event to the container 
launcher to tell the node manager to kill the container.  That is a lot to do 
and I don't want to have the job wait for that to happen since its an error 
state. If you don't wait, then you have a race as to whether everything is 
actually processed. The other issue with sending events is that the final 
jobfinish event is handled by the same async dispatcher so it will be busy 
finishing/shutting down and won't process any further events. So it seems it 
has to be done by a service during the stop call. The container launcher 
already knows what containers it has that aren't complete so I chose to have 
the container launcher kill any containers that haven't completed when its stop 
routine is called.  The bad part is the container launcher didn't have all the 
information required to actually kill the container so I had to add it, which 
I'm not completely happy with but seemed the best fit.  

I will attach a patch shortly.
                
> map task left hanging after AM dies trying to connect to RM
> -----------------------------------------------------------
>
>                 Key: MAPREDUCE-4152
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4152
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.2
>            Reporter: Thomas Graves
>            Assignee: Thomas Graves
>         Attachments: MAPREDUCE-4152.patch
>
>
> We had an instance where the RM went down for more then an hour.  The 
> application master exited with "Could not contact RM after 360000 
> milliseconds"
> 2012-04-11 10:43:36,040 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: 
> job_1333003059741_15999Job Transitioned from RUNNING to ERROR

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to