[ 
https://issues.apache.org/jira/browse/HADOOP-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663983#action_12663983
 ] 

Matei Zaharia commented on HADOOP-4667:
---------------------------------------

How to set the waits is an interesting question. I set them to be times rather 
than number of tasktrackers so that they are very easy for an administrator to 
understand (if they want some kind of guarantee about response time) and so 
that you don't need to take into account number of nodes in your cluster to 
decide what is a reasonable number. However, how long to set them for depends 
on several factors:
* How much you weigh throughput vs response time. If you care only about 
throughput, it's generally better to wait longer.
* Percent of nodes that have local data for you. If you are down to 1-2 map 
tasks left to launch, then maybe the expected wait time until you receive a 
local heartbeat is quite long and you might as well launch non-locally right 
away.
* Nature of tasks. If you have a task that is CPU-heavy, then there's less or 
no gain in response time from launching it non-locally.

Another thing we've thought about is what to do if there's a "hotspot" node 
that everyone wants to run on. In this case, setting the waits too high is a 
bad idea, because you'll end up with a lot of tasks waiting on the hotspot node 
and with other nodes being underutilized. One interesting question though is 
which tasks to launch on the hotspot node. If you have an IO-bound job where 
each task takes 20s to process a block, while another job is more CPU-heavy and 
takes 60s to process a block, then you want to run the IO-bound job locally and 
the CPU-bound job non-locally. The reason is that in the time it takes to run 
one task from the CPU-heavy job, you could've run 3 tasks from the IO-bound one 
and saved sending those 3 blocks across the network as well as saved response 
time.

I haven't included anything to deal with this case or with setting the waits in 
general in this patch, because we found that 10-15 second waits work well for 
dataLocalWait and then 20-25s work well for rackLocalWait. However, in a future 
patch, it might be worthwhile to look at some task statistics to determine IO 
rate for each job and identify the CPU-bound ones, then lower the waits on 
those ones so that they go to non-hotspot nodes.

> Global scheduling in the Fair Scheduler
> ---------------------------------------
>
>                 Key: HADOOP-4667
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4667
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>         Attachments: fs-global-v0.patch
>
>
> The current schedulers in Hadoop all examine a single job on every heartbeat 
> when choosing which tasks to assign, choosing the job based on FIFO or fair 
> sharing. There are inherent limitations to this approach. For example, if the 
> job at the front of the queue is small (e.g. 10 maps, in a cluster of 100 
> nodes), then on average it will launch only one local map on the first 10 
> heartbeats while it is at the head of the queue. This leads to very poor 
> locality for small jobs. Instead, we need a more "global" view of scheduling 
> that can look at multiple jobs. To resolve the locality problem, we will use 
> the following algorithm:
> - If the job at the head of the queue has no node-local task to launch, skip 
> it and look through other jobs.
> - If a job has waited at least T1 seconds while being skipped, also allow it 
> to launch rack-local tasks.
> - If a job has waited at least T2 > T1 seconds, also allow it to launch 
> off-rack tasks.
> This algorithm improves locality while bounding the delay that any job 
> experiences in launching a task.
> It turns out that whether waiting is useful depends on how many tasks are 
> left in the job - the probability of getting a heartbeat from a node with a 
> local task - and on whether the job is CPU or IO bound. Thus there may be 
> logic for removing the wait on the last few tasks in the job.
> As a related issue, once we allow global scheduling, we can launch multiple 
> tasks per heartbeat, as in HADOOP-3136. The initial implementation of 
> HADOOP-3136 adversely affected performance because it only launched multiple 
> tasks from the same job, but with the wait rule above, we will only do this 
> for jobs that are allowed to launch non-local tasks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to