[
https://issues.apache.org/jira/browse/HADOOP-3746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12613022#action_12613022
]
Matei Zaharia commented on HADOOP-3746:
---------------------------------------
*Vivek* - You're right, JIRAs like this one and HADOOP-3445 are adding a lot of
capabilities to the scheduling. Some of them might overlap, but I think it's
useful to have several implementations to choose from - this is done by Linux
for example. The philosophy behind this patch was to create a scheduler that
"just works" in the common use case of multiple users running jobs of various
lengths, without requiring any configuration and administration work such as
setting up pools, etc, but also provide enough capabilities to allow group
allocations to be set up when necessary. This is the situation we want to
support at Facebook (where I'm doing this work) - there is a Hadoop cluster
which has regular users, but it is also becoming popular with external users
who want to run a variety of jobs. Most jobs run as root, and it's tough to
require everybody to specify a project name or task cap, so we want the default
behavior to be sensible and easy to understand.
Some of the code here might be useful for HADOOP-3445 if you provide the
capability to have a per-queue scheduler - then we could have fair scheduling
for some of the queues (hopefully also including the default queue).
At the same time, I want to add a few extensibility points to this scheduler.
First of all, I want to add a way to extend the weight and deficit
calculations, perhaps by providing multiple Adjuster classes that can be
chained together. This could be used for example to boost priority of new jobs
during their first few minutes (reducing response times for interactive
queries), to take into account locality when deciding which job to assign to
each slot, etc. Second, I've already implemented an interface called
LoadManager that takes care of how man tasks should run on each taskTracker.
This currently uses the caps, but an alternate implementation that we might try
is to assign caps based on load (start more tasks on nodes where the running
tasks are not utilizing all of the CPU, bandwidth and memory). The scheduler is
also pretty modular and it's easy to change or reuse particular components,
like the JobSelectors.
> A fair sharing job scheduler
> ----------------------------
>
> Key: HADOOP-3746
> URL: https://issues.apache.org/jira/browse/HADOOP-3746
> Project: Hadoop Core
> Issue Type: New Feature
> Components: mapred
> Reporter: Matei Zaharia
> Assignee: Matei Zaharia
> Priority: Minor
> Fix For: 0.19.0
>
> Attachments: fairscheduler-v1.patch
>
>
> The default job scheduler in Hadoop has a first-in-first-out queue of jobs
> for each priority level. The scheduler always assigns task slots to the first
> job in the highest-level priority queue that is in need of tasks. This makes
> it difficult to share a MapReduce cluster between users because a large job
> will starve subsequent jobs in its queue, but at the same time, giving lower
> priorities to large jobs would cause them to be starved by a stream of
> higher-priority jobs. Today one solution to this problem is to create
> separate MapReduce clusters for different user groups with Hadoop On-Demand,
> but this hurts system utilization because a group's cluster may be mostly
> idle for long periods of time. HADOOP-3445 also addresses this problem by
> sharing a cluster between different queues, but still provides only FIFO
> scheduling within a queue.
> This JIRA proposes a job scheduler based on fair sharing. Fair sharing splits
> up compute time proportionally between jobs that have been submitted,
> emulating an "ideal" scheduler that gives each job 1/Nth of the available
> capacity. When there is a single job running, that job receives all the
> capacity. When other jobs are submitted, tasks slots that free up are
> assigned to the new jobs, so that everyone gets roughly the same amount of
> compute time. This lets short jobs finish in reasonable amounts of time while
> not starving long jobs. This is the type of scheduling used or emulated by
> operating systems - e.g. the Completely Fair Scheduler in Linux. Fair sharing
> can also work with job priorities - the priorities are used as weights to
> determine the fraction of total compute time that a job should get.
> In addition, the scheduler will support a way to guarantee capacity for
> particular jobs or user groups. A job can be marked as belonging to a "pool"
> using a parameter in the jobconf. An "allocations" file on the JobTracker can
> assign a minimum allocation to each pool, which is a minimum number of map
> slots and reduce slots that the pool must be guaranteed to get when it
> contains jobs. The scheduler will ensure that each pool gets at least its
> minimum allocation when it contains jobs, but it will use fair sharing to
> assign any excess capacity, as well as the capacity within each pool. This
> lets an organization divide a cluster between groups similarly to the job
> queues in HADOOP-3445.
> *Implementation Status:*
> I've implemented this scheduler using a version of the pluggable scheduler
> API in HADOOP-3412 that works with Hadoop 0.17. The scheduler supports fair
> sharing, pools, priorities for weighing job shares, and a text-based
> allocation config file that is reloaded at runtime whenever it has changed to
> make it possible to change allocations without restarting the cluster. I will
> also create a patch for trunk that works with the latest interface in the
> patch submitted for HADOOP-3412.
> The actual implementation is simple. To implement fair sharing, the scheduler
> keeps track of a "deficit" for each job - the difference between the amount
> of compute time it should have gotten on an ideal scheduler, and the amount
> of compute time it actually got. This is a measure of how "unfair" we've been
> to the job. Every few hundred milliseconds, the scheduler updates the deficit
> of each job by looking at how many tasks each job had running during this
> interval vs. how many it should have had given its weight and the set of jobs
> that were running in this period. Whenever a task slot becomes available, it
> is assigned to the job with the highest deficit - unless there were one or
> more jobs who were not meeting their pool capacity guarantees, in which case
> we choose among those "needy" jobs based again on their deficit.
> *Extensions:*
> Once we keep track of pools, weights and deficits, we can do a lot of
> interesting things with a fair scheduler. One feature I will probably add is
> an option to give brand new jobs a priority boost until they have run for,
> say, 10 minutes, to reduce response times even further for short jobs such as
> ad-hoc queries, while still being fair to longer-running jobs. It would also
> be easy to add a "maximum number of tasks" cap for each job as in HADOOP-2573
> (although with priorities and pools, this JIRA reduces the need for such a
> cap - you can put a job in its own pool to give it a minimum share, and set
> its priority to VERY_LOW so it never takes excess capacity if there are other
> jobs in the cluster). Finally, I may implement "hierarchical pools" - the
> ability for a group to create pools within its pool, so that it can guarantee
> minimum allocations to various types of jobs but ensure that together, its
> jobs get capacity equal to at least its full pool.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.