Vitaly Kruglikov created MAPREDUCE-5068:
-------------------------------------------

             Summary: Fair Scheduler preemption fails if the other queue has a 
mapreduce job with some tasks in excess of cluster capacity
                 Key: MAPREDUCE-5068
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5068
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: mrv2, scheduler
         Environment: Mac OS X; CDH4.1.1
            Reporter: Vitaly Kruglikov


This is reliably reproduced while running CDH4.1 on a single Mac OS X machine.

# Two queues are being configured: cjmQ and slotsQ. Both queues are configured 
with tiny minResources. The intention is for the task(s) of the job in cjmQ to 
be able to preempt tasks of the job in slotsQ.
# yarn.nodemanager.resource.memory-mb = 24576
# First, a long-running 6-map-task (0 reducers) mapreduce job is started in 
slotsQ with mapreduce.map.memory.mb=4096. Because MRAppMaster's container 
consumes some memory, only 5 of its 6 map tasks are able to start, and the 6th 
is pending, but will never run.
# Then, a short-running 1-map-task (0 reducers) mapreduce job is submitted via 
cjmQ with mapreduce.map.memory.mb=2048.

Expected behavior:
At this point, because the minimum share of cjmQ has not been met, I expected 
Fair Scheduler to preempt one of the executing map tasks from the single slotsQ 
mapreduce job to make room for the single map tasks of the cjmQ mapreduce job. 
However, Fair Scheduler didn't preempt any of the running map tasks of the 
slotsQ job. Instead, the cjmQ jot was being starved perpetually.

Additional useful info:
# If I summit a second 1-map-task mapreduce job via cjmQ, the first cjmQ 
mapreduce job in that Q gets scheduled and its state changes to RUNNING; once 
that that first job completes, then the second job submitted via cjmQ gets 
starved until a third job is submitted into cjmQ, and so on. This happens 
regardless of the values of maxRunningApps in the queue configurations.
# If, instead of requesting 6 map tasks for the slotsQ job, I only request 5 so 
that everything fits nicely into yarn.nodemanager.resource.memory-mb without 
that 6th pending, but not running task, then preemption works as I would have 
expected. However, I cannot rely on this arrangement because in a production 
cluster that is running at full capacity, if a machine dies, the mapreduce job 
from slotsQ will request new containers for the failed tasks and because the 
cluster was already at capacity, those containers will end up as pending and 
will never run, recreating my original scenario.
# I initially wrote this up on 
https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/0zv62pkN5lM,
 so it would be good to update that group with the resolution.

Configuration:

In yarn-site.xml:
{code}
  <property>
    <description>Scheduler plug-in class to use instead of the default 
scheduler.</description>
    <name>yarn.resourcemanager.scheduler.class</name>
    
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
  </property>
{code}

fair-scheduler.xml:
{code}
<configuration>

<!-- Site specific FairScheduler configuration properties -->
  <property>
    <description>Absolute path to allocation file. An allocation file is an XML
    manifest describing queues and their properties, in addition to certain
    policy defaults. This file must be in XML format as described in
    
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html.
    TODO: fix up the hard-coded path in prepare_conf.py
    </description>
    <name>yarn.scheduler.fair.allocation.file</name>
    
<value>[obfuscated]/current/conf/site/default/hadoop/fair-scheduler-allocations.xml</value>
  </property>

  <property>
    <description>Whether to use preemption. Note that preemption is experimental
    in the current version. Defaults to false.</description>
    <name>yarn.scheduler.fair.preemption</name>
    <value>true</value>
  </property>

  <property>
    <description>Whether to allow multiple container assignments in one
    heartbeat. Defaults to false.</description>
    <name>yarn.scheduler.fair.assignmultiple</name>
    <value>true</value>
  </property>
  
</configuration>
{code}

My fair-scheduler-allocations.xml:
{code}
<allocations>

  <queue name="cjmQ">
    <!-- minimum amount of aggregate memory; TODO which units??? -->
    <minResources>2048</minResources>

    <!-- limit the number of apps from the queue to run at once -->
    <maxRunningApps>1</maxRunningApps>
    
    <!-- either "fifo" or "fair" depending on the in-queue scheduling policy
    desired -->
    <schedulingMode>fifo</schedulingMode>

    <!-- Number of seconds after which the pool can preempt other pools'
      tasks to achieve its min share. Requires preemption to be enabled in
      mapred-site.xml by setting mapred.fairscheduler.preemption to true.
      Defaults to infinity (no preemption). -->
    <minSharePreemptionTimeout>5</minSharePreemptionTimeout>

    <!-- Pool's weight in fair sharing calculations. Defaulti is 1.0. -->
    <weight>1.0</weight>
  </queue>

  <queue name="slotsQ">
    <!-- minimum amount of aggregate memory; TODO which units??? -->
    <minResources>1</minResources>

    <!-- limit the number of apps from the queue to run at once -->
    <maxRunningApps>1</maxRunningApps>

    <!-- Number of seconds after which the pool can preempt other pools'
      tasks to achieve its min share. Requires preemption to be enabled in
      mapred-site.xml by setting mapred.fairscheduler.preemption to true.
      Defaults to infinity (no preemption). -->
    <minSharePreemptionTimeout>5</minSharePreemptionTimeout>

    <!-- Pool's weight in fair sharing calculations. Defaulti is 1.0. -->
    <weight>1.0</weight>
  </queue>
  
  <!-- number of seconds a queue is under its fair share before it will try to
  preempt containers to take resources from other queues. -->
  <fairSharePreemptionTimeout>5</fairSharePreemptionTimeout>

</allocations>

{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to