[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14507718#comment-14507718
 ] 

Jason Lowe commented on MAPREDUCE-6329:
---------------------------------------

bq. This would be probably because during rolling upgrade , NM was down for 
some time. So Node_Removed event might have occurred either because of expiry 
or reconnected event. Node removed event kills all the running containers which 
has been done before container is pulled by AM.

That doesn't add up, since the container was just allocated by the node 
heartbeating in.  Therefore I don't see how the RM could reasonably be expiring 
the node, nor should the node be unregistering.  Re-registration does _not_ 
kill containers on the node.  If it did then NM restart could not possibly 
work, since the NM re-registers when it starts up.

> Failure of start map task on NM cause job hang
> ----------------------------------------------
>
>                 Key: MAPREDUCE-6329
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6329
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Peng Zhang
>         Attachments: syslog.tgz, yarn-app.log
>
>
> During rolling update of NM, AM start of container on NM failed. 
> And then job hang there.
> Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to