[
https://issues.apache.org/jira/browse/MAPREDUCE-6470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734155#comment-14734155
]
Karthik Kambatla commented on MAPREDUCE-6470:
---------------------------------------------
Interesting issue. To summarize for others: the headroom is an aggregate across
all nodes in the cluster, and provides no information about the largest
container that can be allocated on the cluster.
MAPREDUCE-6302 should address this issue reactively for MapReduce. However,
there will be a delay which we should try and avoid. Let me create a Yarn JIRA
to discuss/track inclusion of a new API.
> ApplicationMaster may fail to preempt Reduce task
> -------------------------------------------------
>
> Key: MAPREDUCE-6470
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6470
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster, resourcemanager, scheduler
> Affects Versions: 2.7.1
> Reporter: NING DING
>
> In my hadoop cluster the nodemanagers have different resource capacity.
> Recently, when the yarn cluster ran out of resources and there are some big
> jobs running, the AM cannot preempt reduce task.
> The scenario could be simplified as below:
> Say, there are 5 nodemanagers in my hadoop cluster with FairScheduler
> strategy enabled.
> NodeManager Capacity :
> namenode1 <1024 memory, 1 cpu-vcores>
> namenode1 <4096 memory, 1 cpu-vcores>
> namenode1 <4096 memory, 1 cpu-vcores>
> namenode1 <1024 memory, 4 cpu-vcores>
> namenode1 <1024 memory, 4 cpu-vcores>
> Start one job including 10 maps and 10 reduces with following conf :
> yarn.app.mapreduce.am.resource.mb=1024m
> yarn.app.mapreduce.am.resource.cpu-vcores=1
> mapreduce.map.memory.mb=1024m
> mapreduce.reduce.memory.mb=1024m
> mapreduce.map.cpu.vcores=1
> mapreduce.reduce.cpu.vcores=1
> After some map tasks finished, 4 reduce tasks started, but there are still
> some map tasks in scheduledRequests.
> At this time, the 5 nodemanagers resource usage is blow.
> NodeManager, Memory Used, Vcores Used, Memory Avail, Vcore Abail
> namenode1, 1024m, 1, 0, 0
> namenode2, 1024m, 1, 3072m, 0
> namenode3, 1024m, 1, 3072m, 0
> namenode4, 1024m, 1, 0, 3
> namenode5, 1024m, 1, 0, 3
> So AM try to start the rest map tasks.
> In RMContainerAllocator the availableResources got from
> ApplicationMasterService is <6144m, 6 cpu-vcores>.
> Then RMContainerAllocator thinks there is enough resource to start one map
> task, so it will not try to preempt the reduce task. But in fact there isn't
> any single nodemanager has enough resource available to run one map task. In
> this case, AM will fail to obtain the container to start the rest map tasks.
> And since reduce tasks will not be preempted, the resource will never been
> released, then the job hangs forever.
> I think the problem is that the overall resource headroom is not enough to
> help AM made the right decision on whether to preempt the reduce task or not.
> We need to provide more information to AM, e.g. adds a new api in
> AllocateResponse to get available resource list on all nodemanagers. But this
> approaching might cost too much overhead.
> Any ideas?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)