[jira] [Commented] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time

Varun Saxena (JIRA) Fri, 16 Oct 2015 04:11:02 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960518#comment-14960518
 ]


Varun Saxena commented on MAPREDUCE-6513:
-----------------------------------------

One more thing I noticed is that in 
RMContainerAllocator#preemptReducesIfNeeded, we simply clear the scheduled 
reduces map and put these reducers to pending. This is not updated in ask. So 
RM keeps on assigning and AM is not able to assign as no reducer is 
scheduled(check logs below the code). Although this eventually leads to these 
reducers not being assigned, but why we are not immediately updating the ask ?

{code}
        LOG.info("Ramping down all scheduled reduces:"
            + scheduledRequests.reduces.size());
        for (ContainerRequest req : scheduledRequests.reduces.values()) {
          pendingReduces.add(req);
        }
        scheduledRequests.reduces.clear();
{code}

{noformat}
2015-10-13 04:55:04,912 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container not 
assigned : container_1437451211867_1485_01_000215
2015-10-13 04:55:04,912 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Cannot assign 
container Container: [ContainerId: container_1437451211867_1485_01_000216, 
NodeId: hdszzdcxdat6g06u04p:26009, NodeHttpAddress: hdszzdcxdat6g06u04p:26010, 
Resource: <memory:4096, vCores:1>, Priority: 10, Token: Token { kind: 
ContainerToken, service: 10.2.33.236:26009 }, ] for a reduce as either  
container memory less than required 4096 or no pending reduce tasks - 
reduces.isEmpty=true
2015-10-13 04:55:04,912 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container not 
assigned : container_1437451211867_1485_01_000216
2015-10-13 04:55:04,912 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Cannot assign 
container Container: [ContainerId: container_1437451211867_1485_01_000217, 
NodeId: hdszzdcxdat6g06u06p:26009, NodeHttpAddress: hdszzdcxdat6g06u06p:26010, 
Resource: <memory:4096, vCores:1>, Priority: 10, Token: Token { kind: 
ContainerToken, service: 10.2.33.239:26009 }, ] for a reduce as either  
container memory less than required 4096 or no pending reduce tasks - 
reduces.isEmpty=true
{noformat}

> MR job got hanged forever when one NM unstable for some time
> ------------------------------------------------------------
>
>                 Key: MAPREDUCE-6513
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6513
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, resourcemanager
>    Affects Versions: 2.7.0
>            Reporter: Bob
>            Assignee: Varun Saxena
>            Priority: Critical
>
> when job is in-progress which is having more tasks,one node became unstable 
> due to some OS issue.After the node became unstable, the map on this node 
> status changed to KILLED state. 
> Currently maps which were running on unstable node are rescheduled, and all 
> are in scheduled state and wait for RM assign container.Seen ask requests for 
> map till Node is good (all those failed), there are no ask request after 
> this. But AM keeps on preempting the reducers (it's recycling).
> Finally reducers are waiting for complete mappers and mappers did n't get 
> container..
> My Question Is:
> ============
> why map requests did not sent AM ,once after node recovery.?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time

Reply via email to