A fair sharing job scheduler
----------------------------

                 Key: HADOOP-3746
                 URL: https://issues.apache.org/jira/browse/HADOOP-3746
             Project: Hadoop Core
          Issue Type: New Feature
          Components: mapred
            Reporter: Matei Zaharia
            Assignee: Matei Zaharia
            Priority: Minor
             Fix For: 0.19.0


The default job scheduler in Hadoop has a first-in-first-out queue of jobs for 
each priority level. The scheduler always assigns task slots to the first job 
in the highest-level priority queue that is in need of tasks. This makes it 
difficult to share a MapReduce cluster between users because a large job will 
starve subsequent jobs in its queue, but at the same time, giving lower 
priorities to large jobs would cause them to be starved by a stream of 
higher-priority jobs. Today one solution to this problem is to create separate 
MapReduce clusters for different user groups with Hadoop On-Demand, but this 
hurts system utilization because a group's cluster may be mostly idle for long 
periods of time. HADOOP-3445 also addresses this problem by sharing a cluster 
between different queues, but still provides only FIFO scheduling within a 
queue.

This JIRA proposes a job scheduler based on fair sharing. Fair sharing splits 
up compute time proportionally between jobs that have been submitted, emulating 
an "ideal" scheduler that gives each job 1/Nth of the available capacity. When 
there is a single job running, that job receives all the capacity. When other 
jobs are submitted, tasks slots that free up are assigned to the new jobs, so 
that everyone gets roughly the same amount of compute time. This lets short 
jobs finish in reasonable amounts of time while not starving long jobs. This is 
the type of scheduling used or emulated by operating systems - e.g. the 
Completely Fair Scheduler in Linux. Fair sharing can also work with job 
priorities - the priorities are used as weights to determine the fraction of 
total compute time that a job should get. 

In addition, the scheduler will support a way to guarantee capacity for 
particular jobs or user groups. A job can be marked as belonging to a "pool" 
using a parameter in the jobconf. An "allocations" file on the JobTracker can 
assign a minimum allocation to each pool, which is a minimum number of map 
slots and reduce slots that the pool must be guaranteed to get when it contains 
jobs. The scheduler will ensure that each pool gets at least its minimum 
allocation when it contains jobs, but it will use fair sharing to assign any 
excess capacity, as well as the capacity within each pool. This lets an 
organization divide a cluster between groups similarly to the job queues in 
HADOOP-3445.

*Implementation Status:*

I've implemented this scheduler using a version of the pluggable scheduler API 
in HADOOP-3412 that works with Hadoop 0.17. The scheduler supports fair 
sharing, pools, priorities for weighing job shares, and a text-based allocation 
config file that is reloaded at runtime whenever it has changed to make it 
possible to change allocations without restarting the cluster. I will also 
create a patch for trunk that works with the latest interface in the patch 
submitted for HADOOP-3412.

The actual implementation is simple. To implement fair sharing, the scheduler 
keeps track of a "deficit" for each job - the difference between the amount of 
compute time it should have gotten on an ideal scheduler, and the amount of 
compute time it actually got. This is a measure of how "unfair" we've been to 
the job. Every few hundred milliseconds, the scheduler updates the deficit of 
each job by looking at how many tasks each job had running during this interval 
vs. how many it should have had given its weight and the set of jobs that were 
running in this period. Whenever a task slot becomes available, it is assigned 
to the job with the highest deficit - unless there were one or more jobs who 
were not meeting their pool capacity guarantees, in which case we choose among 
those "needy" jobs based again on their deficit.

*Extensions:*

Once we keep track of pools, weights and deficits, we can do a lot of 
interesting things with a fair scheduler. One feature I will probably add is an 
option to give brand new jobs a priority boost until they have run for, say, 10 
minutes, to reduce response times even further for short jobs such as ad-hoc 
queries, while still being fair to longer-running jobs. It would also be easy 
to add a "maximum number of tasks" cap for each job as in HADOOP-2573 (although 
with priorities and pools, this JIRA reduces the need for such a cap - you can 
put a job in its own pool to give it a minimum share, and set its priority to 
VERY_LOW so it never takes excess capacity if there are other jobs in the 
cluster). Finally, I may implement "hierarchical pools" - the ability for a 
group to create pools within its pool, so that it can guarantee minimum 
allocations to various types of jobs but ensure that together, its jobs get 
capacity equal to at least its full pool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to