[ 
https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718326#action_12718326
 ] 

Matei Zaharia commented on HADOOP-4665:
---------------------------------------

Hi Vinod,

Sorry for not addressing the half fair share point earlier, it looks like I 
forgot to post that. The difference between isStarvedForFairShare and 
tasksDueToFairShare is intentional. I want the threshold for triggering fair 
share preemption to be a lot lower than the fair share so that it doesn't 
trigger unless something is going horribly wrong in the cluster. The reason is 
that in standard use of the fair scheduler, we expect any critical 
("production") jobs to have min shares set, which are enforced much more 
precisely (and potentially with a smaller timeout). The fair share is for those 
jobs that are not critical and that we're okay with being a little unfair to if 
that reduces wasted work. So the service model is that we only do preemption if 
the job is being starved very badly. However, when we do preemption, we do 
bring you up to your full fair share, because at that point it's clear that 
you've been starved badly for a long time. Once you are at your full fair 
share, it will be easy for you to remain there as you'll be given chances to 
reuse those slots when your tasks finish. If some users request for a stricter 
enforcement of fair shares, we can make the "half" part configurable later, but 
we decided this model is a good way to prevent unnecessary preemption and 
exchange of slots back and forth between jobs, while also not being too unfair.

I'll make a patch with the other changes sometime in the next few days or maybe 
after I see some of your comments.

The changes in test cases and docs are indeed huge. The request was huge ;) 
(and important), and I took this opportunity to clean up the fair scheduler 
docs overall.

> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 0.21.0
>
>         Attachments: fs-preemption-v0.patch, hadoop-4665-v1.patch, 
> hadoop-4665-v1b.patch, hadoop-4665-v2.patch, hadoop-4665-v3.patch, 
> hadoop-4665-v4.patch, hadoop-4665-v5.patch, hadoop-4665-v6.patch, 
> hadoop-4665-v7.patch, hadoop-4665-v7b.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: 
> users might submit long-running tasks by mistake (e.g. an infinite loop in a 
> map program), or tasks may be long due to having to process large amounts of 
> data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity 
> for certain queues, as well as a goal of providing good performance for 
> interactive jobs on average through fair sharing. Therefore, it will support 
> preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 
> seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds 
> (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to 
> meet guarantees quickly. T2 is meant as a last resort in case non-critical 
> jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the 
> following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, 
> ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting 
> wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the 
> copy phase where there is not much map output per task (at Facebook, we have 
> observed this to be the main time we need preemption - when a job has a long 
> map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to