[ 
https://issues.apache.org/jira/browse/HADOOP-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664655#action_12664655
 ] 

Matei Zaharia commented on HADOOP-4665:
---------------------------------------

Those are good points Joydeep. I will remove the getting conf variables every 
time, that was a mistake.

About subtracting the waits: This is just a question of how we interpret the 
parameters. Maybe we want nodeLocalWait and rackLocalWaits to be two different 
times that get added up for some "total wait". I originally meant for 
rackLocalWait to always be bigger than nodeLocalWait and thus capture the 
maximum delay. But since that is confusing and can lead to misconfiguration, I 
will make them add up as you said.

For your third point, the idea was as follows: When it still has a lot of maps 
left to launch, a job will almost always have node-local tasks, so waiting is 
fine. However, when there are only a few maps left, there will be fewer nodes 
on which it can launch node-local tasks, and these may be busy running long 
tasks or something. So, when it's waited for nodeLocalWait amount of time, the 
job will start being allowed to launch rack-local tasks instead. Once it has 
launched such a task, it is allowed to launch more rack-local tasks rather than 
having to begin the waiting all over again so that it doesn't drastically slow 
down the rate at which it can launch tasks if the nodes with node-local data 
still aren't becoming free. However, we remember the locality level of the last 
map launched, so if the job ever *does* manage to launch a node-local task 
again, we begin the wait period again. There's a similar story for going from 
rack-local to off-rack: the idea is that once you've had to wait so long as to 
launch an off-rack task, you probably have very few opportunities left for 
launching rack-local or node-local tasks, so you might as well be allowed to 
launch more off-rack tasks and finish the job rather than having your launch 
rate slowed to a trickle.

Now it's possible that just using shorter waits but requiring the wait to 
happen every time you need to launch a non-local task will work too. I don't 
know, but it seemed that in my gridmix tests the current implementation worked 
fine even for very small jobs (going from 2% to 75-80% node locality for 3-map 
tasks on a 100-node cluster).


> Add preemption to the fair scheduler
> ------------------------------------
>
>                 Key: HADOOP-4665
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4665
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>         Attachments: fs-preemption-v0.patch
>
>
> Task preemption is necessary in a multi-user Hadoop cluster for two reasons: 
> users might submit long-running tasks by mistake (e.g. an infinite loop in a 
> map program), or tasks may be long due to having to process large amounts of 
> data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity 
> for certain queues, as well as a goal of providing good performance for 
> interactive jobs on average through fair sharing. Therefore, it will support 
> preempting under two conditions:
> 1) A job isn't getting its _guaranteed_ share of the cluster for at least T1 
> seconds.
> 2) A job is getting significantly less than its _fair_ share for T2 seconds 
> (e.g. less than half its share).
> T1 will be chosen smaller than T2 (and will be configurable per queue) to 
> meet guarantees quickly. T2 is meant as a last resort in case non-critical 
> jobs in queues with no guaranteed capacity are being starved.
> When deciding which tasks to kill to make room for the job, we will use the 
> following heuristics:
> - Look for tasks to kill only in jobs that have more than their fair share, 
> ordering these by deficit (most overscheduled jobs first).
> - For maps: kill tasks that have run for the least amount of time (limiting 
> wasted time).
> - For reduces: similar to maps, but give extra preference for reduces in the 
> copy phase where there is not much map output per task (at Facebook, we have 
> observed this to be the main time we need preemption - when a job has a long 
> map phase and its reducers are mostly sitting idle and filling up slots).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to