[jira] Commented: (HADOOP-2014) Job Tracker should prefer input-splits from overloaded racks

eric baldeschwieler (JIRA) Thu, 14 Feb 2008 11:11:29 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12569048#action_12569048
 ]


eric baldeschwieler commented on HADOOP-2014:
---------------------------------------------

At some point we'll need to put a lot more work into this!  How to trade off 
rack vs local vs size is very interesting.

That said, when you look at our current execution profiles, we have long tails 
because maps are not running locally.  Anything we can do to reduce that should 
lead to speedups.

We should run some experiments.  We should not neglect the one rack case, where 
size should probably dominate.  Maybe we can express rack load as the 
probability that a block will execute locally and then weight that by the cost 
to ship bytes?  This varies of course based on your network...  We could choose 
some constants for on and off rack to get started.

> Job Tracker should prefer input-splits from overloaded racks
> ------------------------------------------------------------
>
>                 Key: HADOOP-2014
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2014
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Runping Qi
>            Assignee: Devaraj Das
>
> Currently, when the Job Tracker assigns a mapper task to a task tracker and 
> there is no local split to the task tracker, the
> job tracker will find the first runable task in the mast task list  and 
> assign the task to the task tracker.
> The split for the task is not local to the task tracker, of course. However, 
> the split may be local to other task trackers.
> Assigning the that task, to that task tracker may decrease the potential 
> number of mapper attempts with data locality.
> The desired behavior in this situation is to choose a task whose split is not 
> local to any  task tracker. 
> Resort to the current behavior only if no such task is found.
> In general, it will be useful to know the number of task trackers to which 
> each split is local.
> To assign a task to a task tracker, the job tracker should first  try to pick 
> a task that is local to the task tracker  and that has minimal number of task 
> trackers to which it is local. If no task is local to the task tracker, the 
> job tracker should  try to pick a task that has minimal number of task 
> trackers to which it is local. 
> It is worthwhile to instrument the job tracker code to report the number of 
> splits that are local to some task trackers.
> That should be the maximum number of tasks with data locality. By comparing 
> that number with the the actual number of 
> data local mappers launched, we can know the effectiveness of the job tracker 
> scheduling.
> When we introduce rack locality, we should apply the same principle.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2014) Job Tracker should prefer input-splits from overloaded racks

Reply via email to