[jira] [Commented] (MAPREDUCE-4152) map task left hanging after AM dies trying to connect to RM

Vinod Kumar Vavilapalli (JIRA) Mon, 30 Apr 2012 12:14:11 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265125#comment-13265125
 ]


Vinod Kumar Vavilapalli commented on MAPREDUCE-4152:
----------------------------------------------------

The Job did not kill off the map task that it had running before exiting. In 
JobImpl when it moves from RUNNING to ERROR, all it does is send the 
JobUnsuccessfulCompletion event. I would think it would atleast try to kill any 
tasks it has.
bq. This is reasonable.

bq. Now there might also be another issue with NM as to why it didn't kill it. 
Can you please investigate? This seems a more important issue that needs fixing.

Regarding the patch:
 - TA_CONTAINER_CLEANED is not handled in TaskAttempt when it is in running 
state, it will cause more cascading errors which we can avoid by making it a 
legal state at running too.
 - kill() can and should be made reentrant in case the container was already 
killed. (There is a small race when a container can be killed twice. This 
happens only after the patch, as stop() is out-of-band)
 - Very minor-nit: It is natural for the Container constructor to take in all 
the info it needs, instead of launcher event.
                
> map task left hanging after AM dies trying to connect to RM
> -----------------------------------------------------------
>
>                 Key: MAPREDUCE-4152
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4152
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.2
>            Reporter: Thomas Graves
>            Assignee: Thomas Graves
>         Attachments: MAPREDUCE-4152.patch
>
>
> We had an instance where the RM went down for more then an hour.  The 
> application master exited with "Could not contact RM after 360000 
> milliseconds"
> 2012-04-11 10:43:36,040 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: 
> job_1333003059741_15999Job Transitioned from RUNNING to ERROR

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4152) map task left hanging after AM dies trying to connect to RM

Reply via email to