[
https://issues.apache.org/jira/browse/MAPREDUCE-6329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14507067#comment-14507067
]
Jason Lowe commented on MAPREDUCE-6329:
---------------------------------------
Scanning the AM logs, it looks like this may be a situation where the AM is
waiting for the RM to allocate a new container but the RM thinks all asks are
fulfilled. We would need to look into the RM logs to try to verify.
I noticed this odd sequence in the AM log:
{noformat}
2015-04-20 21:36:37,225 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated
containers 2
[...]
2015-04-20 21:36:37,236 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container
container_1428390739155_23973_01_000002 to
attempt_1428390739155_23973_m_000000_0
[...]
2015-04-20 21:36:37,246 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container
container_1428390739155_23973_01_000003 to
attempt_1428390739155_23973_m_000001_0
[... container 3 proceeds to fail to launch ...]
2015-04-20 21:36:38,259 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed
container container_1428390739155_23973_01_000003
[...]
2015-04-20 21:36:39,276 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed
container container_1428390739155_23973_01_000004
2015-04-20 21:36:39,276 ERROR [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete
event for unknown container id container_1428390739155_23973_01_000004
{noformat}
I see the AM received two containers from the "Got allocated 2 containers" log
message, presumably for containers 000002 and 000003. Then suddenly the AM is
notified of a released container 000004 that apparently was never allocated? I
do not see a corresponding "Got allocated" message that would indicate the AM
ever saw container 000004. That may explain why the AM is stuck. If the RM
thought it allocated a container to the AM and it was released then it will
think all asks are satisfied. However the AM would need to re-ask for the
final map container or the job will not progress. We need to look into the RM
log and find the RM's perspective of what happened to
container_1428390739155_23973_01_000004.
> Failure of start map task on NM cause job hang
> ----------------------------------------------
>
> Key: MAPREDUCE-6329
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6329
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Affects Versions: 2.6.0
> Reporter: Peng Zhang
> Attachments: syslog.tgz
>
>
> During rolling update of NM, AM start of container on NM failed.
> And then job hang there.
> Attach AM logs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)