[jira] Commented: (HADOOP-3293) When an input split spans cross block boundary, the split location should be the host having most of bytes on it.

Jothi Padmanabhan (JIRA) Thu, 30 Oct 2008 08:17:19 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643989#action_12643989
 ]


Jothi Padmanabhan commented on HADOOP-3293:
-------------------------------------------

Since the BlkIndex is used only to identify the hosts,
{code}
        int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining,
                                       splitSize);
          splits.add(new FileSplit(path, length-bytesRemaining, splitSize, 
                                   blkLocations[blkIndex].getHosts()));
{code}

we could also modify getBlockIndex() to return a list of hosts that contain the 
maximum data for that split. For example, if the split was 
Block1  80Bytes   Hosts-A,B,C
Block2 100Bytes  Hosts A,D,E
Block 3 70Bytes   Hosts D,F,B

We would identify the hosts and their contribution as
A  180
B 150
C 80
D 170
E 100
F 70

We could return A,D,B


> When an input split spans cross block boundary, the split location should be 
> the host having most of bytes on it. 
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3293
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3293
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Runping Qi
>            Assignee: Jothi Padmanabhan
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3293) When an input split spans cross block boundary, the split location should be the host having most of bytes on it.

Reply via email to