[
https://issues.apache.org/jira/browse/MAPREDUCE-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266960#comment-13266960
]
Vinod Kumar Vavilapalli commented on MAPREDUCE-4152:
----------------------------------------------------
Going back and forth on this one, apologies.
So the situation is that RM went down somehow and AM exited without killing its
tasks. This is expected IIRC. Here's what I think:
- When RM restart works, AMs should *never* exit because of connection issues.
(Of course, there is a corner case of AMs network itself being down, we should
handle that somehow)
- Even in the short term, if RM goes down and AM exits in the mean while,
whenever RM is back up, it will(should) kill all the containers of this
application( by commanding the NMs to do so).
Given above, I don't see why the AM needs to handle this specially. May be I am
missing something?
> map task left hanging after AM dies trying to connect to RM
> -----------------------------------------------------------
>
> Key: MAPREDUCE-4152
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4152
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 0.23.2
> Reporter: Thomas Graves
> Assignee: Thomas Graves
> Attachments: MAPREDUCE-4152.patch, MAPREDUCE-4152.patch
>
>
> We had an instance where the RM went down for more then an hour. The
> application master exited with "Could not contact RM after 360000
> milliseconds"
> 2012-04-11 10:43:36,040 INFO [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl:
> job_1333003059741_15999Job Transitioned from RUNNING to ERROR
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira