[
https://issues.apache.org/jira/browse/HADOOP-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12618476#action_12618476
]
Joydeep Sen Sarma commented on HADOOP-3136:
-------------------------------------------
hey folks - there are some downsides with what's proposed here:
- it's likely that on an idle cluster with a series of new jobs - each new job
would end up on a subset of nodes (the first nodes to send heartbeats after job
arrival). this is generally speaking bad:
- jobs have different characterists and it's good to spread them around (esp.
pertinent to memory usage)
- reduce tasks have bursty network/disk traffic pattern and in general
scheduling multiple of them on the same node is bad
- map locality could get worse (would we schedule non-local jobs even if a node
advertises availability of multiple slots? this is a very hard problem to solve)
the idea of dispatching multiple scheduling decisions in one shot is cool. but
the idea of making multiple scheduling decisions only considering the
availability of a single node is not that great. the current scheduler limps
along with this badness since it only makes one decision at a time (so what's
locally optimal is usually globally optimal - esp. as far as map locality is
concerned). but that would increasingly not be the case when multiple decisions
need to be made in one shot.
here's a slightly different - but even easier way of solving this that avoids
this downside:
- make the heartbeat rate from TT inversely proportional to the number idle
slots on the TT.
- the exact formula is approx.: send S/I heartbeats per second on behalf of
idle slots from each TT (where I is the number of idle slots and S is the
average task completion time (either parameterized, or observed).
The reasoning is as follows:
- in a busy cluster (lots of tasks in and out) - there is no problem with
delays in dispatching tasks. And we know that scheduling decisions are
reasonably satisfactory today (or at least that is an issue outside of the
scope of this Jira).
- in a busy cluster - if the average task completion time is S and total number
of slots per TT is N - then on average each TT sends S/N heartbeats per second
- so the idea is to make an idle cluster mimic a busy cluster as fat as TT/JT
communication goes. We do this by pretending that the idle slots are actually
busy - and thereby sending heartbeats every S/I seconds on behalf of the idle
slots.
there should not be any impact on JT scalability or overall network traffic.
the network and JT should be able to bear traffic that would result from the
cluster being 100% busy (and that's what we are emulating here). in addition -
where we are adding traffic (cluster idling) - other traffic (DFS/shuffle)
traffic would be lower.
longer term - we should try to separate scheduling (which should consider
global resource availability) from dispatching (which is a per node activity).
But this seems like a much bigger problem.
> Assign multiple tasks per TaskTracker heartbeat
> -----------------------------------------------
>
> Key: HADOOP-3136
> URL: https://issues.apache.org/jira/browse/HADOOP-3136
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Reporter: Devaraj Das
> Assignee: Arun C Murthy
> Fix For: 0.19.0
>
>
> In today's logic of finding a new task, we assign only one task per heartbeat.
> We probably could give the tasktracker multiple tasks subject to the max
> number of free slots it has - for maps we could assign it data local tasks.
> We could probably run some logic to decide what to give it if we run out of
> data local tasks (e.g., tasks from overloaded racks, tasks that have least
> locality, etc.). In addition to maps, if it has reduce slots free, we could
> give it reduce task(s) as well. Again for reduces we could probably run some
> logic to give more tasks to nodes that are closer to nodes running most maps
> (assuming data generated is proportional to the number of maps). For e.g., if
> rack1 has 70% of the input splits, and we know that most maps are data/rack
> local, we try to schedule ~70% of the reducers there.
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.