[
https://issues.apache.org/jira/browse/HADOOP-3746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711797#action_12711797
]
Todd Lipcon commented on HADOOP-3746:
-------------------------------------
The 0.18.3 patch attached to this ticket introduces a race condition
(documented in HADOOP-5852) where the interTrackerServer is started before the
taskTrackerManager member is set. Will attach a delta against that patch here
> A fair sharing job scheduler
> ----------------------------
>
> Key: HADOOP-3746
> URL: https://issues.apache.org/jira/browse/HADOOP-3746
> Project: Hadoop Core
> Issue Type: New Feature
> Components: mapred
> Reporter: Matei Zaharia
> Assignee: Matei Zaharia
> Priority: Minor
> Fix For: 0.19.0
>
> Attachments: fairscheduler-0.17.2.patch, fairscheduler-0.18.1.patch,
> fairscheduler-0.18.3.patch, fairscheduler-v1.patch, fairscheduler-v2.patch,
> fairscheduler-v3.1.patch, fairscheduler-v3.patch, fairscheduler-v4.patch,
> fairscheduler-v5.1.patch, fairscheduler-v5.2.patch, fairscheduler-v5.patch,
> fairscheduler-v6.patch
>
>
> The default job scheduler in Hadoop has a first-in-first-out queue of jobs
> for each priority level. The scheduler always assigns task slots to the first
> job in the highest-level priority queue that is in need of tasks. This makes
> it difficult to share a MapReduce cluster between users because a large job
> will starve subsequent jobs in its queue, but at the same time, giving lower
> priorities to large jobs would cause them to be starved by a stream of
> higher-priority jobs. Today one solution to this problem is to create
> separate MapReduce clusters for different user groups with Hadoop On-Demand,
> but this hurts system utilization because a group's cluster may be mostly
> idle for long periods of time. HADOOP-3445 also addresses this problem by
> sharing a cluster between different queues, but still provides only FIFO
> scheduling within a queue.
> This JIRA proposes a job scheduler based on fair sharing. Fair sharing splits
> up compute time proportionally between jobs that have been submitted,
> emulating an "ideal" scheduler that gives each job 1/Nth of the available
> capacity. When there is a single job running, that job receives all the
> capacity. When other jobs are submitted, tasks slots that free up are
> assigned to the new jobs, so that everyone gets roughly the same amount of
> compute time. This lets short jobs finish in reasonable amounts of time while
> not starving long jobs. This is the type of scheduling used or emulated by
> operating systems - e.g. the Completely Fair Scheduler in Linux. Fair sharing
> can also work with job priorities - the priorities are used as weights to
> determine the fraction of total compute time that a job should get.
> In addition, the scheduler will support a way to guarantee capacity for
> particular jobs or user groups. A job can be marked as belonging to a "pool"
> using a parameter in the jobconf. An "allocations" file on the JobTracker can
> assign a minimum allocation to each pool, which is a minimum number of map
> slots and reduce slots that the pool must be guaranteed to get when it
> contains jobs. The scheduler will ensure that each pool gets at least its
> minimum allocation when it contains jobs, but it will use fair sharing to
> assign any excess capacity, as well as the capacity within each pool. This
> lets an organization divide a cluster between groups similarly to the job
> queues in HADOOP-3445.
> *Implementation Status:*
> I've implemented this scheduler using a version of the pluggable scheduler
> API in HADOOP-3412 that works with Hadoop 0.17. The scheduler supports fair
> sharing, pools, priorities for weighing job shares, and a text-based
> allocation config file that is reloaded at runtime whenever it has changed to
> make it possible to change allocations without restarting the cluster. I will
> also create a patch for trunk that works with the latest interface in the
> patch submitted for HADOOP-3412.
> The actual implementation is simple. To implement fair sharing, the scheduler
> keeps track of a "deficit" for each job - the difference between the amount
> of compute time it should have gotten on an ideal scheduler, and the amount
> of compute time it actually got. This is a measure of how "unfair" we've been
> to the job. Every few hundred milliseconds, the scheduler updates the deficit
> of each job by looking at how many tasks each job had running during this
> interval vs. how many it should have had given its weight and the set of jobs
> that were running in this period. Whenever a task slot becomes available, it
> is assigned to the job with the highest deficit - unless there were one or
> more jobs who were not meeting their pool capacity guarantees, in which case
> we choose among those "needy" jobs based again on their deficit.
> *Extensions:*
> Once we keep track of pools, weights and deficits, we can do a lot of
> interesting things with a fair scheduler. One feature I will probably add is
> an option to give brand new jobs a priority boost until they have run for,
> say, 10 minutes, to reduce response times even further for short jobs such as
> ad-hoc queries, while still being fair to longer-running jobs. It would also
> be easy to add a "maximum number of tasks" cap for each job as in HADOOP-2573
> (although with priorities and pools, this JIRA reduces the need for such a
> cap - you can put a job in its own pool to give it a minimum share, and set
> its priority to VERY_LOW so it never takes excess capacity if there are other
> jobs in the cluster). Finally, I may implement "hierarchical pools" - the
> ability for a group to create pools within its pool, so that it can guarantee
> minimum allocations to various types of jobs but ensure that together, its
> jobs get capacity equal to at least its full pool.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.