[jira] [Commented] (MAPREDUCE-6470) ApplicationMaster may fail to preempt Reduce task

Karthik Kambatla (JIRA) Mon, 07 Sep 2015 19:44:25 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734155#comment-14734155
 ]


Karthik Kambatla commented on MAPREDUCE-6470:
---------------------------------------------

Interesting issue. To summarize for others: the headroom is an aggregate across 
all nodes in the cluster, and provides no information about the largest 
container that can be allocated on the cluster. 

MAPREDUCE-6302 should address this issue reactively for MapReduce. However, 
there will be a delay which we should try and avoid. Let me create a Yarn JIRA 
to discuss/track inclusion of a new API.

> ApplicationMaster may fail to preempt Reduce task
> -------------------------------------------------
>
>                 Key: MAPREDUCE-6470
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6470
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, resourcemanager, scheduler
>    Affects Versions: 2.7.1
>            Reporter: NING DING
>
> In my hadoop cluster the nodemanagers have different resource capacity. 
> Recently, when the yarn cluster ran out of resources and there are some big 
> jobs running, the AM cannot preempt reduce task. 
> The scenario could be simplified as below:
> Say, there are 5 nodemanagers in my hadoop cluster with FairScheduler 
> strategy enabled.
> NodeManager Capacity :
> namenode1 <1024 memory, 1 cpu-vcores>
> namenode1 <4096 memory, 1 cpu-vcores>
> namenode1 <4096 memory, 1 cpu-vcores>
> namenode1 <1024 memory, 4 cpu-vcores>
> namenode1 <1024 memory, 4 cpu-vcores>
> Start one job including 10 maps and 10 reduces with following conf :
> yarn.app.mapreduce.am.resource.mb=1024m
> yarn.app.mapreduce.am.resource.cpu-vcores=1
> mapreduce.map.memory.mb=1024m
> mapreduce.reduce.memory.mb=1024m
> mapreduce.map.cpu.vcores=1
> mapreduce.reduce.cpu.vcores=1
> After some map tasks finished, 4 reduce tasks started, but there are still 
> some map tasks in scheduledRequests.
> At this time, the 5 nodemanagers resource usage is blow.
> NodeManager, Memory Used, Vcores Used, Memory Avail, Vcore Abail
> namenode1,     1024m,          1,          0,           0
> namenode2,     1024m,          1,          3072m,       0
> namenode3,     1024m,          1,          3072m,       0
> namenode4,     1024m,          1,          0,           3
> namenode5,     1024m,          1,          0,           3
> So AM try to start the rest map tasks.
> In RMContainerAllocator the availableResources got from 
> ApplicationMasterService is <6144m, 6 cpu-vcores>.
> Then RMContainerAllocator thinks there is enough resource to start one map 
> task, so it will not try to preempt the reduce task. But in fact there isn't 
> any single nodemanager has enough resource available to run one map task. In 
> this case, AM will fail to obtain the container to start the rest map tasks. 
> And since reduce tasks will not be preempted, the resource will never been 
> released, then the job hangs forever.
> I think the problem is that the overall resource headroom is not enough to 
> help AM made the right decision on whether to preempt the reduce task or not. 
> We need to provide more information to AM, e.g. adds a new api in 
> AllocateResponse to get available resource list on all nodemanagers. But this 
> approaching might cost too much overhead. 
> Any ideas?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-6470) ApplicationMaster may fail to preempt Reduce task

Reply via email to