hey doug, thanks for the quick reply. this is an interesting scenario that you bring up: a large hdfs pool across data centers with, perhaps, a jobtracker per data center (or a jobtracker per rack). i'm still not clear how rack-locality helps map input performance here; JobInProgress.findNewTaskseems to see the world as local or other, with no rack-awareness. are you referring to the ReplicationTargetChooser, which will always put one of the three replicas (assuming your replication level is 3) on your rack, hence increasing your chances of finding the block within your sub-jobtracker net?
thanks, jeff On 9/17/07, Doug Cutting <[EMAIL PROTECTED]> wrote: > > Jeff Hammerbacher wrote: > > has anyone leveraged the ability of datanodes to specify which > datacenter > > and rack they live in? if so, any evidence of performance > improvements? it > > seems that rack-awareness is only leveraged in block replication, not in > > task execution. > > It often doesn't make a big improvement for map input, since in the > common configuration, map tasks can nearly always be scheduled on nodes > where the data is local. However, if you have a large HDFS cluster and > overlay smaller mapreduce clusters over subsets of the hosts, then > rack-locality can help map input performance too. > > Doug >
