[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856713#action_12856713
 ] 

Hong Tang commented on MAPREDUCE-1698:
--------------------------------------

I think we should enforce that InputSplit.getLocations() only includes hosts 
that contain at least a certain percentage of local data (and/or local+rack). 

Ideally, I'd like to see an interface as follows:

{noformat}
public static class LocationInfo {
  String host;
  float hostLocalDataPercent;
  float rackLocalDataPercent;
}

LocationInfo[] getLocations();

{noformat}

And we can introduce two config parameters 
"mapreduce.split.min-host-local-data-percent" and 
"mapreduce.split.min-host-rack-data-percent". And a host will be included iff 
hostLocalDataPercent >= %mapreduce.split.min-host-local-data-percent AND 
(hostLocalDataPercent + rackLocalDataPercent) >= 
%mapreduce.split.min-host-rack-data-percent.

There are two benefits of doing so:
- The number of hosts for each split should be much manageable (translating to 
lower memory consumption of JT, and simpler scheduling). In the case when most 
of the data are rack local, we do not need to list all hosts on the rack.
- JobTracker can quantitatively distinguish cases like (40% host local + 40% 
rack local) vs (80% host local).

> InputSplit.getLocations() semantics is not clear.
> -------------------------------------------------
>
>                 Key: MAPREDUCE-1698
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1698
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Hong Tang
>
> The semantics of InputSplit.getLocations() is not clear and this has led to 
> sub optimal implementations and even bugs (PIG-878) in the past.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to