[ 
https://issues.apache.org/jira/browse/HADOOP-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667061#action_12667061
 ] 

Matei Zaharia commented on HADOOP-4667:
---------------------------------------

Having some kind of SchedulingInfo data structure associated with each job 
might be a good start towards separating scheduling stuff from non-scheduling 
stuff and ultimately having a common scheduler codebase. It could be a field 
within the JobInProgress, and maybe schedulers would be allowed to extend the 
base class to have their own info attributes. Is this the kind of thing you're 
proposing?

I'm still somewhat wary about making obtainNewMapTask sometimes not return 
tasks due to this scheduling opportunity stuff though. It seems like the more 
things we add to it, the harder it will be to break the cycle and switch to a 
saner API (hasMapTask and createMapTask for example). Furthermore, once a 
technique like this is in the JobInProgress class, it's hard to try out other 
methods for achieving the same goal. One of the nicer things about having the 
scheduler API is that although it makes the codebase more fragmented, it's 
enabled us to experiment with stuff like this. As a concrete example, once this 
basic patch is finished, I want to try an example, once this basic patch is in, 
I want to try a refinement for dealing with hotspot nodes that will launch 
IO-intensive tasks preferentially on those nodes to maximize the rate of local 
IO. Would it make sense to be working in JobInProgress for that? So I'd prefer 
if there was provide this functionality to all the schedulers without 
ingraining it in JobInProgress, or at least without blocking the road towards 
changes to this policy. Perhaps figuring out how to split up obtainNewMapTask 
into something generic that everyone can use (give me a task now) and something 
smarter (count attempts and handle the wait for me) and perhaps even a version 
that just says whether there is a task with the given locality level without 
initializing it would be possible without significant code changes. Does that 
make sense or do you think it's better to leave the API exactly the same and do 
this by default?

> Global scheduling in the Fair Scheduler
> ---------------------------------------
>
>                 Key: HADOOP-4667
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4667
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/fair-share
>            Reporter: Matei Zaharia
>         Attachments: fs-global-v0.patch, HADOOP-4667_api.patch
>
>
> The current schedulers in Hadoop all examine a single job on every heartbeat 
> when choosing which tasks to assign, choosing the job based on FIFO or fair 
> sharing. There are inherent limitations to this approach. For example, if the 
> job at the front of the queue is small (e.g. 10 maps, in a cluster of 100 
> nodes), then on average it will launch only one local map on the first 10 
> heartbeats while it is at the head of the queue. This leads to very poor 
> locality for small jobs. Instead, we need a more "global" view of scheduling 
> that can look at multiple jobs. To resolve the locality problem, we will use 
> the following algorithm:
> - If the job at the head of the queue has no node-local task to launch, skip 
> it and look through other jobs.
> - If a job has waited at least T1 seconds while being skipped, also allow it 
> to launch rack-local tasks.
> - If a job has waited at least T2 > T1 seconds, also allow it to launch 
> off-rack tasks.
> This algorithm improves locality while bounding the delay that any job 
> experiences in launching a task.
> It turns out that whether waiting is useful depends on how many tasks are 
> left in the job - the probability of getting a heartbeat from a node with a 
> local task - and on whether the job is CPU or IO bound. Thus there may be 
> logic for removing the wait on the last few tasks in the job.
> As a related issue, once we allow global scheduling, we can launch multiple 
> tasks per heartbeat, as in HADOOP-3136. The initial implementation of 
> HADOOP-3136 adversely affected performance because it only launched multiple 
> tasks from the same job, but with the wait rule above, we will only do this 
> for jobs that are allowed to launch non-local tasks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to