[
https://issues.apache.org/jira/browse/HADOOP-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12568806#action_12568806
]
Amar Kamat commented on HADOOP-2014:
------------------------------------
Makes sense. I had some similar thoughts but not on task ordering (initially)
but on task scheduling after cache miss (at rack level), see HADOOP-2812.
I opened a new issue since this issue explicitly specifies _rack_.
On similar lines we can have
1) machine load = _f_(machine) = _some_function_of_(num-splits-local) ...
useful in intra-rack scheduling,
_f_ should give some indication of the expected time to process all the maps i.e
_f_(machine) = (num-local-splits - num-processed ) * avg-processing-time / _MAX_
initial value of avg-processing-time = _MAX_
2) split-load = _s_(split) = _min_(loads of machine having this split locally)
3) rack-load = _r_(rack) = _max_( load of splits local to the rack)/
num-map-slots .. useful in inter-rack scheduling
This gives priority to the rack with the highest loaded split.
_avg_ can be used instead of _max_ in (3) which will give priority to the rack
with highest avg load.
Would this be a better metric?
Thoughts?
> Job Tracker should prefer input-splits from overloaded racks
> ------------------------------------------------------------
>
> Key: HADOOP-2014
> URL: https://issues.apache.org/jira/browse/HADOOP-2014
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Reporter: Runping Qi
> Assignee: Devaraj Das
>
> Currently, when the Job Tracker assigns a mapper task to a task tracker and
> there is no local split to the task tracker, the
> job tracker will find the first runable task in the mast task list and
> assign the task to the task tracker.
> The split for the task is not local to the task tracker, of course. However,
> the split may be local to other task trackers.
> Assigning the that task, to that task tracker may decrease the potential
> number of mapper attempts with data locality.
> The desired behavior in this situation is to choose a task whose split is not
> local to any task tracker.
> Resort to the current behavior only if no such task is found.
> In general, it will be useful to know the number of task trackers to which
> each split is local.
> To assign a task to a task tracker, the job tracker should first try to pick
> a task that is local to the task tracker and that has minimal number of task
> trackers to which it is local. If no task is local to the task tracker, the
> job tracker should try to pick a task that has minimal number of task
> trackers to which it is local.
> It is worthwhile to instrument the job tracker code to report the number of
> splits that are local to some task trackers.
> That should be the maximum number of tasks with data locality. By comparing
> that number with the the actual number of
> data local mappers launched, we can know the effectiveness of the job tracker
> scheduling.
> When we introduce rack locality, we should apply the same principle.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.