[jira] Commented: (HADOOP-3293) When an input split spans cross block boundary, the split location should be the host having most of bytes on it.

Jothi Padmanabhan (JIRA) Thu, 30 Oct 2008 23:29:39 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644224#action_12644224
 ]


Jothi Padmanabhan commented on HADOOP-3293:
-------------------------------------------

Runping's point on aggregating bytes over racks to determine rack locality 
makes sense.

The problem is that the JobClient is unaware of the topology. Some ways to 
build the topology awareness are:
# Make the JobClient query the topology service and build its own topology 
awareness. The problem with this approach is that we need to ensure that the 
topology script that is used by the JobClient and the JobTracker are always in 
sync. 
# Let the client get back rack information along with the hosts when it queries 
the FS for block locations (fs.getFileBlockLocations). We could add a new 
method fs.getResolvedFileBlockLocations that returns the hosts with the rack 
information. The default implementation would just return the hosts, DFS would 
override this method and will return the rack information along with the hosts. 
We are guranteed correct topology information as the Namenode and JobTracker 
would be using the same topology information.

The second approach looks better. Thoughts?



> When an input split spans cross block boundary, the split location should be 
> the host having most of bytes on it. 
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3293
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3293
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Runping Qi
>            Assignee: Jothi Padmanabhan
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3293) When an input split spans cross block boundary, the split location should be the host having most of bytes on it.

Reply via email to