[ 
https://issues.apache.org/jira/browse/HADOOP-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643792#action_12643792
 ] 

Jothi Padmanabhan commented on HADOOP-3293:
-------------------------------------------

The reason for this is that  FileInputFormat.getBlockIndex() returns the 
blockindex of the starting block for the given offset. Instead, it should 
identify all the blocks that this particular split spans and then choose the 
block that contributes the maximum data for this split.  

We could use the following approach

{code}
//Calculate the number of blocks the split spans
if (numBlocks == 1)
  return startIndex;
else if (numBlocks == 2)
  return (bytesInFirstBlock > bytesInSecondBlock) ? startIndex:startIndex+1;
else 
  return startIndex + 1;
{code}

The rationale here is that if there are more than two blocks, we are guaranteed 
that block 2 is contributing its entire block length for this split.

Note that we cannot do the identification of the block index based on the 
amount of data contributed by the individual host, because of the replication 
factor.
For example, consider the following example (assume dfs block size = 100)
Block 1 contributes 20 bytes and its hosts are A,B,C
Block 2 contributes 100 bytes and its hosts are A, D,E
Block 3 contributes 10 bytes and its hosts are  D,E,F

If we aggregate on a per host basis, host A having contributed 120 bytes would 
be the ideal choice. However, if we choose Block 1 as the index to be returned, 
even hosts B &C would be treated as data local, which is sub optimal.  
Thoughts?

> When an input split spans cross block boundary, the split location should be 
> the host having most of bytes on it. 
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3293
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3293
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Runping Qi
>            Assignee: Jothi Padmanabhan
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to