[ https://issues.apache.org/jira/browse/HADOOP-3746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711797#action_12711797 ]
Todd Lipcon commented on HADOOP-3746: ------------------------------------- The 0.18.3 patch attached to this ticket introduces a race condition (documented in HADOOP-5852) where the interTrackerServer is started before the taskTrackerManager member is set. Will attach a delta against that patch here > A fair sharing job scheduler > ---------------------------- > > Key: HADOOP-3746 > URL: https://issues.apache.org/jira/browse/HADOOP-3746 > Project: Hadoop Core > Issue Type: New Feature > Components: mapred > Reporter: Matei Zaharia > Assignee: Matei Zaharia > Priority: Minor > Fix For: 0.19.0 > > Attachments: fairscheduler-0.17.2.patch, fairscheduler-0.18.1.patch, > fairscheduler-0.18.3.patch, fairscheduler-v1.patch, fairscheduler-v2.patch, > fairscheduler-v3.1.patch, fairscheduler-v3.patch, fairscheduler-v4.patch, > fairscheduler-v5.1.patch, fairscheduler-v5.2.patch, fairscheduler-v5.patch, > fairscheduler-v6.patch > > > The default job scheduler in Hadoop has a first-in-first-out queue of jobs > for each priority level. The scheduler always assigns task slots to the first > job in the highest-level priority queue that is in need of tasks. This makes > it difficult to share a MapReduce cluster between users because a large job > will starve subsequent jobs in its queue, but at the same time, giving lower > priorities to large jobs would cause them to be starved by a stream of > higher-priority jobs. Today one solution to this problem is to create > separate MapReduce clusters for different user groups with Hadoop On-Demand, > but this hurts system utilization because a group's cluster may be mostly > idle for long periods of time. HADOOP-3445 also addresses this problem by > sharing a cluster between different queues, but still provides only FIFO > scheduling within a queue. > This JIRA proposes a job scheduler based on fair sharing. Fair sharing splits > up compute time proportionally between jobs that have been submitted, > emulating an "ideal" scheduler that gives each job 1/Nth of the available > capacity. When there is a single job running, that job receives all the > capacity. When other jobs are submitted, tasks slots that free up are > assigned to the new jobs, so that everyone gets roughly the same amount of > compute time. This lets short jobs finish in reasonable amounts of time while > not starving long jobs. This is the type of scheduling used or emulated by > operating systems - e.g. the Completely Fair Scheduler in Linux. Fair sharing > can also work with job priorities - the priorities are used as weights to > determine the fraction of total compute time that a job should get. > In addition, the scheduler will support a way to guarantee capacity for > particular jobs or user groups. A job can be marked as belonging to a "pool" > using a parameter in the jobconf. An "allocations" file on the JobTracker can > assign a minimum allocation to each pool, which is a minimum number of map > slots and reduce slots that the pool must be guaranteed to get when it > contains jobs. The scheduler will ensure that each pool gets at least its > minimum allocation when it contains jobs, but it will use fair sharing to > assign any excess capacity, as well as the capacity within each pool. This > lets an organization divide a cluster between groups similarly to the job > queues in HADOOP-3445. > *Implementation Status:* > I've implemented this scheduler using a version of the pluggable scheduler > API in HADOOP-3412 that works with Hadoop 0.17. The scheduler supports fair > sharing, pools, priorities for weighing job shares, and a text-based > allocation config file that is reloaded at runtime whenever it has changed to > make it possible to change allocations without restarting the cluster. I will > also create a patch for trunk that works with the latest interface in the > patch submitted for HADOOP-3412. > The actual implementation is simple. To implement fair sharing, the scheduler > keeps track of a "deficit" for each job - the difference between the amount > of compute time it should have gotten on an ideal scheduler, and the amount > of compute time it actually got. This is a measure of how "unfair" we've been > to the job. Every few hundred milliseconds, the scheduler updates the deficit > of each job by looking at how many tasks each job had running during this > interval vs. how many it should have had given its weight and the set of jobs > that were running in this period. Whenever a task slot becomes available, it > is assigned to the job with the highest deficit - unless there were one or > more jobs who were not meeting their pool capacity guarantees, in which case > we choose among those "needy" jobs based again on their deficit. > *Extensions:* > Once we keep track of pools, weights and deficits, we can do a lot of > interesting things with a fair scheduler. One feature I will probably add is > an option to give brand new jobs a priority boost until they have run for, > say, 10 minutes, to reduce response times even further for short jobs such as > ad-hoc queries, while still being fair to longer-running jobs. It would also > be easy to add a "maximum number of tasks" cap for each job as in HADOOP-2573 > (although with priorities and pools, this JIRA reduces the need for such a > cap - you can put a job in its own pool to give it a minimum share, and set > its priority to VERY_LOW so it never takes excess capacity if there are other > jobs in the cluster). Finally, I may implement "hierarchical pools" - the > ability for a group to create pools within its pool, so that it can guarantee > minimum allocations to various types of jobs but ensure that together, its > jobs get capacity equal to at least its full pool. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.