[ https://issues.apache.org/jira/browse/MAPREDUCE-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856713#action_12856713 ]
Hong Tang commented on MAPREDUCE-1698: -------------------------------------- I think we should enforce that InputSplit.getLocations() only includes hosts that contain at least a certain percentage of local data (and/or local+rack). Ideally, I'd like to see an interface as follows: {noformat} public static class LocationInfo { String host; float hostLocalDataPercent; float rackLocalDataPercent; } LocationInfo[] getLocations(); {noformat} And we can introduce two config parameters "mapreduce.split.min-host-local-data-percent" and "mapreduce.split.min-host-rack-data-percent". And a host will be included iff hostLocalDataPercent >= %mapreduce.split.min-host-local-data-percent AND (hostLocalDataPercent + rackLocalDataPercent) >= %mapreduce.split.min-host-rack-data-percent. There are two benefits of doing so: - The number of hosts for each split should be much manageable (translating to lower memory consumption of JT, and simpler scheduling). In the case when most of the data are rack local, we do not need to list all hosts on the rack. - JobTracker can quantitatively distinguish cases like (40% host local + 40% rack local) vs (80% host local). > InputSplit.getLocations() semantics is not clear. > ------------------------------------------------- > > Key: MAPREDUCE-1698 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1698 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Reporter: Hong Tang > > The semantics of InputSplit.getLocations() is not clear and this has led to > sub optimal implementations and even bugs (PIG-878) in the past. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira