[ http://issues.apache.org/jira/browse/HADOOP-692?page=comments#action_12448827 ] Hairong Kuang commented on HADOOP-692: --------------------------------------
Is there any way that we could find out distance (#hops) between two nodes at runtime? Traceout prints all the hops but it has to run at one of the node. Once the namenode could find out the distance between any two nodes, it can find out which nodes belong to the same rack (nodes has a distance of 1 belong to a same rack). It can also get the distance between any two racks. Once the name node has this distance info, it can perform the following rack-aware optimizations: 1. replica placement policy It's similar to what Konstantin proposed but with a slight change. - place the first replica in a node closest to the writer. - place the 2nd replica on a different node on the same rack as the 1st node. - place the 3rd replica on a different rack with a distance of 2 to the first node. - place the rest of the replicas on a random not-yet-selected one. 2. orgnize the write pipeline in the ascending order of distance to the writer. 3. for reading, the best node is defined as the node closiest to the reader > Rack-aware Replica Placement > ---------------------------- > > Key: HADOOP-692 > URL: http://issues.apache.org/jira/browse/HADOOP-692 > Project: Hadoop > Issue Type: Improvement > Components: dfs > Affects Versions: 0.8.0 > Reporter: Hairong Kuang > Assigned To: Hairong Kuang > Fix For: 0.9.0 > > > This issue assumes that HDFS runs on a cluster of computers that spread > across many racks. Communication between two nodes on different racks needs > to go through switches. Bandwidth in/out of a rack may be less than the total > bandwidth of machines in the rack. The purpose of rack-aware replica > placement is to improve data reliability, availability, and network bandwidth > utilization. The basic idea is that each data node determines to which rack > it belongs at the startup time and notifies the name node of the rack id upon > registration. The name node maintains a rackid-to-datanode map and tries to > place replicas across racks. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira