[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856712#action_12856712
 ] 

Hong Tang commented on MAPREDUCE-1698:
--------------------------------------

The current InputSplit (o.a.h.mapreduce) javadoc reads as follows:
{noformat}
  /**
   * Get the list of nodes by name where the data for the split would be local.
   * The locations do not need to be serialized.
   * @return a new array of the node nodes.
   * @throws IOException
   * @throws InterruptedException
   */
  public abstract 
    String[] getLocations() throws IOException, InterruptedException;
{noformat}

This brings a few questions in mind:
# Is null allowed (the answer seems to be NO from existing implementations).
# How do we specify the case where a task can run on any nodes? (To return an 
empty array.)
# What if a split includes blocks from different hosts?

The third question is particularly troublesome. In the past:
- MultiFileSplit (for MultiFileInputFormat) adds all hosts for the first block 
of every file in the split. 
- CombineFileSplit (for CombineFileInputFormat) combines blocks on the same 
rack in one split, and it adds all hosts it knows about in that rack to the 
list of hosts returned from getLocations().

In both cases, it is quite misleading to JobTracker when it tries to schedule 
tasks based on host locality. Worse, it increases the scheduling overhead when 
a job processes a large number of small files and thus every split returns a 
long list of hosts.

> InputSplit.getLocations() semantics is not clear.
> -------------------------------------------------
>
>                 Key: MAPREDUCE-1698
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1698
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Hong Tang
>
> The semantics of InputSplit.getLocations() is not clear and this has led to 
> sub optimal implementations and even bugs (PIG-878) in the past.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to