[ 
https://issues.apache.org/jira/browse/HADOOP-3746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711797#action_12711797
 ] 

Todd Lipcon commented on HADOOP-3746:
-------------------------------------

The 0.18.3 patch attached to this ticket introduces a race condition 
(documented in HADOOP-5852) where the interTrackerServer is started before the 
taskTrackerManager member is set. Will attach a delta against that patch here

> A fair sharing job scheduler
> ----------------------------
>
>                 Key: HADOOP-3746
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3746
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>            Priority: Minor
>             Fix For: 0.19.0
>
>         Attachments: fairscheduler-0.17.2.patch, fairscheduler-0.18.1.patch, 
> fairscheduler-0.18.3.patch, fairscheduler-v1.patch, fairscheduler-v2.patch, 
> fairscheduler-v3.1.patch, fairscheduler-v3.patch, fairscheduler-v4.patch, 
> fairscheduler-v5.1.patch, fairscheduler-v5.2.patch, fairscheduler-v5.patch, 
> fairscheduler-v6.patch
>
>
> The default job scheduler in Hadoop has a first-in-first-out queue of jobs 
> for each priority level. The scheduler always assigns task slots to the first 
> job in the highest-level priority queue that is in need of tasks. This makes 
> it difficult to share a MapReduce cluster between users because a large job 
> will starve subsequent jobs in its queue, but at the same time, giving lower 
> priorities to large jobs would cause them to be starved by a stream of 
> higher-priority jobs. Today one solution to this problem is to create 
> separate MapReduce clusters for different user groups with Hadoop On-Demand, 
> but this hurts system utilization because a group's cluster may be mostly 
> idle for long periods of time. HADOOP-3445 also addresses this problem by 
> sharing a cluster between different queues, but still provides only FIFO 
> scheduling within a queue.
> This JIRA proposes a job scheduler based on fair sharing. Fair sharing splits 
> up compute time proportionally between jobs that have been submitted, 
> emulating an "ideal" scheduler that gives each job 1/Nth of the available 
> capacity. When there is a single job running, that job receives all the 
> capacity. When other jobs are submitted, tasks slots that free up are 
> assigned to the new jobs, so that everyone gets roughly the same amount of 
> compute time. This lets short jobs finish in reasonable amounts of time while 
> not starving long jobs. This is the type of scheduling used or emulated by 
> operating systems - e.g. the Completely Fair Scheduler in Linux. Fair sharing 
> can also work with job priorities - the priorities are used as weights to 
> determine the fraction of total compute time that a job should get. 
> In addition, the scheduler will support a way to guarantee capacity for 
> particular jobs or user groups. A job can be marked as belonging to a "pool" 
> using a parameter in the jobconf. An "allocations" file on the JobTracker can 
> assign a minimum allocation to each pool, which is a minimum number of map 
> slots and reduce slots that the pool must be guaranteed to get when it 
> contains jobs. The scheduler will ensure that each pool gets at least its 
> minimum allocation when it contains jobs, but it will use fair sharing to 
> assign any excess capacity, as well as the capacity within each pool. This 
> lets an organization divide a cluster between groups similarly to the job 
> queues in HADOOP-3445.
> *Implementation Status:*
> I've implemented this scheduler using a version of the pluggable scheduler 
> API in HADOOP-3412 that works with Hadoop 0.17. The scheduler supports fair 
> sharing, pools, priorities for weighing job shares, and a text-based 
> allocation config file that is reloaded at runtime whenever it has changed to 
> make it possible to change allocations without restarting the cluster. I will 
> also create a patch for trunk that works with the latest interface in the 
> patch submitted for HADOOP-3412.
> The actual implementation is simple. To implement fair sharing, the scheduler 
> keeps track of a "deficit" for each job - the difference between the amount 
> of compute time it should have gotten on an ideal scheduler, and the amount 
> of compute time it actually got. This is a measure of how "unfair" we've been 
> to the job. Every few hundred milliseconds, the scheduler updates the deficit 
> of each job by looking at how many tasks each job had running during this 
> interval vs. how many it should have had given its weight and the set of jobs 
> that were running in this period. Whenever a task slot becomes available, it 
> is assigned to the job with the highest deficit - unless there were one or 
> more jobs who were not meeting their pool capacity guarantees, in which case 
> we choose among those "needy" jobs based again on their deficit.
> *Extensions:*
> Once we keep track of pools, weights and deficits, we can do a lot of 
> interesting things with a fair scheduler. One feature I will probably add is 
> an option to give brand new jobs a priority boost until they have run for, 
> say, 10 minutes, to reduce response times even further for short jobs such as 
> ad-hoc queries, while still being fair to longer-running jobs. It would also 
> be easy to add a "maximum number of tasks" cap for each job as in HADOOP-2573 
> (although with priorities and pools, this JIRA reduces the need for such a 
> cap - you can put a job in its own pool to give it a minimum share, and set 
> its priority to VERY_LOW so it never takes excess capacity if there are other 
> jobs in the cluster). Finally, I may implement "hierarchical pools" - the 
> ability for a group to create pools within its pool, so that it can guarantee 
> minimum allocations to various types of jobs but ensure that together, its 
> jobs get capacity equal to at least its full pool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to