[ https://issues.apache.org/jira/browse/HADOOP-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644224#action_12644224 ]
Jothi Padmanabhan commented on HADOOP-3293: ------------------------------------------- Runping's point on aggregating bytes over racks to determine rack locality makes sense. The problem is that the JobClient is unaware of the topology. Some ways to build the topology awareness are: # Make the JobClient query the topology service and build its own topology awareness. The problem with this approach is that we need to ensure that the topology script that is used by the JobClient and the JobTracker are always in sync. # Let the client get back rack information along with the hosts when it queries the FS for block locations (fs.getFileBlockLocations). We could add a new method fs.getResolvedFileBlockLocations that returns the hosts with the rack information. The default implementation would just return the hosts, DFS would override this method and will return the rack information along with the hosts. We are guranteed correct topology information as the Namenode and JobTracker would be using the same topology information. The second approach looks better. Thoughts? > When an input split spans cross block boundary, the split location should be > the host having most of bytes on it. > ------------------------------------------------------------------------------------------------------------------ > > Key: HADOOP-3293 > URL: https://issues.apache.org/jira/browse/HADOOP-3293 > Project: Hadoop Core > Issue Type: Bug > Components: mapred > Reporter: Runping Qi > Assignee: Jothi Padmanabhan > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.