[
https://issues.apache.org/jira/browse/MAPREDUCE-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13603245#comment-13603245
]
Vitaly Kruglikov commented on MAPREDUCE-5068:
---------------------------------------------
Per Sandy's recommendation, I opened the CDH JIRA
https://issues.cloudera.org/browse/DISTRO-466
> Fair Scheduler preemption fails if the other queue has a mapreduce job with
> some tasks in excess of cluster capacity
> --------------------------------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-5068
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5068
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2, scheduler
> Environment: Mac OS X; CDH4.1.2; CDH4.2.0
> Reporter: Vitaly Kruglikov
> Labels: hadoop
>
> This is reliably reproduced while running CDH4.1.2 on a single Mac OS X
> machine.
> # Two queues are being configured: cjmQ and slotsQ. Both queues are
> configured with tiny minResources. The intention is for the task(s) of the
> job in cjmQ to be able to preempt tasks of the job in slotsQ.
> # yarn.nodemanager.resource.memory-mb = 24576
> # First, a long-running 6-map-task (0 reducers) mapreduce job is started in
> slotsQ with mapreduce.map.memory.mb=4096. Because MRAppMaster's container
> consumes some memory, only 5 of its 6 map tasks are able to start, and the
> 6th is pending, but will never run.
> # Then, a short-running 1-map-task (0 reducers) mapreduce job is submitted
> via cjmQ with mapreduce.map.memory.mb=2048.
> Expected behavior:
> At this point, because the minimum share of cjmQ has not been met, I expected
> Fair Scheduler to preempt one of the executing map tasks from the single
> slotsQ mapreduce job to make room for the single map tasks of the cjmQ
> mapreduce job. However, Fair Scheduler didn't preempt any of the running map
> tasks of the slotsQ job. Instead, the cjmQ job was being starved perpetually.
> Since slotsQ had far more than its minimum share allocated to it and already
> running, while cjmQ was far below its minimum share (0 actually), Fair
> Scheduler should have started preempting, regardless of there being one task
> container from the slotsQ job (the 6th map container) that was not being
> allocated.
> Additional useful info:
> # If I summit a second 1-map-task mapreduce job via cjmQ, the first cjmQ
> mapreduce job in that Q gets scheduled and its state changes to RUNNING; once
> that that first job completes, then the second job submitted via cjmQ gets
> starved until a third job is submitted into cjmQ, and so on. This happens
> regardless of the values of maxRunningApps in the queue configurations.
> # If, instead of requesting 6 map tasks for the slotsQ job, I only request 5
> so that everything fits nicely into yarn.nodemanager.resource.memory-mb -
> without that 6th pending, but not running task - then preemption works as I
> would have expected. However, I cannot rely on this arrangement because in a
> production cluster that is running at full capacity, if a machine dies, the
> mapreduce job from slotsQ will request new containers for the failed tasks
> and because the cluster was already at capacity, those containers will end up
> as pending and will never run, recreating my original scenario of the
> starving cjmQ job.
> # I initially wrote this up on
> https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/0zv62pkN5lM,
> so it would be good to update that group with the resolution.
> Configuration:
> In yarn-site.xml:
> {code}
> <property>
> <description>Scheduler plug-in class to use instead of the default
> scheduler.</description>
> <name>yarn.resourcemanager.scheduler.class</name>
>
> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
> </property>
> {code}
> fair-scheduler.xml:
> {code}
> <configuration>
> <!-- Site specific FairScheduler configuration properties -->
> <property>
> <description>Absolute path to allocation file. An allocation file is an
> XML
> manifest describing queues and their properties, in addition to certain
> policy defaults. This file must be in XML format as described in
>
> http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html.
> </description>
> <name>yarn.scheduler.fair.allocation.file</name>
>
> <value>[obfuscated]/current/conf/site/default/hadoop/fair-scheduler-allocations.xml</value>
> </property>
> <property>
> <description>Whether to use preemption. Note that preemption is
> experimental
> in the current version. Defaults to false.</description>
> <name>yarn.scheduler.fair.preemption</name>
> <value>true</value>
> </property>
> <property>
> <description>Whether to allow multiple container assignments in one
> heartbeat. Defaults to false.</description>
> <name>yarn.scheduler.fair.assignmultiple</name>
> <value>true</value>
> </property>
>
> </configuration>
> {code}
> My fair-scheduler-allocations.xml:
> {code}
> <allocations>
> <queue name="cjmQ">
> <!-- minimum amount of aggregate memory; TODO which units??? -->
> <minResources>2048</minResources>
> <!-- limit the number of apps from the queue to run at once -->
> <maxRunningApps>1</maxRunningApps>
>
> <!-- either "fifo" or "fair" depending on the in-queue scheduling policy
> desired -->
> <schedulingMode>fifo</schedulingMode>
> <!-- Number of seconds after which the pool can preempt other pools'
> tasks to achieve its min share. Requires preemption to be enabled in
> mapred-site.xml by setting mapred.fairscheduler.preemption to true.
> Defaults to infinity (no preemption). -->
> <minSharePreemptionTimeout>5</minSharePreemptionTimeout>
> <!-- Pool's weight in fair sharing calculations. Defaulti is 1.0. -->
> <weight>1.0</weight>
> </queue>
> <queue name="slotsQ">
> <!-- minimum amount of aggregate memory; TODO which units??? -->
> <minResources>1</minResources>
> <!-- limit the number of apps from the queue to run at once -->
> <maxRunningApps>1</maxRunningApps>
> <!-- Number of seconds after which the pool can preempt other pools'
> tasks to achieve its min share. Requires preemption to be enabled in
> mapred-site.xml by setting mapred.fairscheduler.preemption to true.
> Defaults to infinity (no preemption). -->
> <minSharePreemptionTimeout>5</minSharePreemptionTimeout>
> <!-- Pool's weight in fair sharing calculations. Defaulti is 1.0. -->
> <weight>1.0</weight>
> </queue>
>
> <!-- number of seconds a queue is under its fair share before it will try to
> preempt containers to take resources from other queues. -->
> <fairSharePreemptionTimeout>5</fairSharePreemptionTimeout>
> </allocations>
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira