[jira] [Commented] (MAPREDUCE-6329) Failure of start map task on NM cause job hang

Jason Lowe (JIRA) Wed, 22 Apr 2015 06:32:59 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14507067#comment-14507067
 ]


Jason Lowe commented on MAPREDUCE-6329:
---------------------------------------

Scanning the AM logs, it looks like this may be a situation where the AM is 
waiting for the RM to allocate a new container but the RM thinks all asks are 
fulfilled.  We would need to look into the RM logs to try to verify.

I noticed this odd sequence in the AM log:
{noformat}
2015-04-20 21:36:37,225 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated 
containers 2
[...]
2015-04-20 21:36:37,236 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container 
container_1428390739155_23973_01_000002 to 
attempt_1428390739155_23973_m_000000_0
[...]
2015-04-20 21:36:37,246 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container 
container_1428390739155_23973_01_000003 to 
attempt_1428390739155_23973_m_000001_0
[... container 3 proceeds to fail to launch ...]
2015-04-20 21:36:38,259 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed 
container container_1428390739155_23973_01_000003
[...]
2015-04-20 21:36:39,276 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed 
container container_1428390739155_23973_01_000004
2015-04-20 21:36:39,276 ERROR [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete 
event for unknown container id container_1428390739155_23973_01_000004
{noformat}

I see the AM received two containers from the "Got allocated 2 containers" log 
message, presumably for containers 000002 and 000003.  Then suddenly the AM is 
notified of a released container 000004 that apparently was never allocated?  I 
do not see a corresponding "Got allocated" message that would indicate the AM 
ever saw container 000004.  That may explain why the AM is stuck.  If the RM 
thought it allocated a container to the AM and it was released then it will 
think all asks are satisfied.  However the AM would need to re-ask for the 
final map container or the job will not progress.  We need to look into the RM 
log and find the RM's perspective of what happened to 
container_1428390739155_23973_01_000004.

> Failure of start map task on NM cause job hang
> ----------------------------------------------
>
>                 Key: MAPREDUCE-6329
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6329
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Peng Zhang
>         Attachments: syslog.tgz
>
>
> During rolling update of NM, AM start of container on NM failed. 
> And then job hang there.
> Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6329) Failure of start map task on NM cause job hang

Reply via email to