[
https://issues.apache.org/jira/browse/MAPREDUCE-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726774#action_12726774
]
Matei Zaharia commented on MAPREDUCE-548:
-----------------------------------------
I've been porting this JIRA to trunk now that HADOOP-4665 is in, and I tried
making it use missed scheduling opportunities rather than time waitd in the
process. However, I discovered a problem with that approach. Suppose that the
cluster is full of long-running tasks except for 1 slot, and that our waiting
job doesn't have local data on this slot. If we count missed scheduling
opportunities and wait until we've seen as many as the total number of nodes,
then we'll wait for (numNodes * heartbeatInterval) seconds, which is a very
long time. On the other hand, setting the threshold to something smaller won't
work in the case where the cluster is mostly idle. The problem is that the
number of scheduling opportunities you get per second depends on the nature of
tasks running in the cluster.
Therefore, I'm going to switch this patch back to counting time so that we have
control over the amount of waiting. There will be a single call to
System.currentTimeMillis at the start of the scheduler's assignTasks method.
Owen, does one call per assignTasks have any performance impact in your
experience? I imagine that even logging data through log4j causes gettimeofday
to be invoked.
> Global scheduling in the Fair Scheduler
> ---------------------------------------
>
> Key: MAPREDUCE-548
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-548
> Project: Hadoop Map/Reduce
> Issue Type: New Feature
> Reporter: Matei Zaharia
> Attachments: fs-global-v0.patch, hadoop-4667-v1.patch,
> hadoop-4667-v1b.patch, hadoop-4667-v2.patch, HADOOP-4667_api.patch
>
>
> The current schedulers in Hadoop all examine a single job on every heartbeat
> when choosing which tasks to assign, choosing the job based on FIFO or fair
> sharing. There are inherent limitations to this approach. For example, if the
> job at the front of the queue is small (e.g. 10 maps, in a cluster of 100
> nodes), then on average it will launch only one local map on the first 10
> heartbeats while it is at the head of the queue. This leads to very poor
> locality for small jobs. Instead, we need a more "global" view of scheduling
> that can look at multiple jobs. To resolve the locality problem, we will use
> the following algorithm:
> - If the job at the head of the queue has no node-local task to launch, skip
> it and look through other jobs.
> - If a job has waited at least T1 seconds while being skipped, also allow it
> to launch rack-local tasks.
> - If a job has waited at least T2 > T1 seconds, also allow it to launch
> off-rack tasks.
> This algorithm improves locality while bounding the delay that any job
> experiences in launching a task.
> It turns out that whether waiting is useful depends on how many tasks are
> left in the job - the probability of getting a heartbeat from a node with a
> local task - and on whether the job is CPU or IO bound. Thus there may be
> logic for removing the wait on the last few tasks in the job.
> As a related issue, once we allow global scheduling, we can launch multiple
> tasks per heartbeat, as in HADOOP-3136. The initial implementation of
> HADOOP-3136 adversely affected performance because it only launched multiple
> tasks from the same job, but with the wait rule above, we will only do this
> for jobs that are allowed to launch non-local tasks.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.