Jeff Hammerbacher wrote:
has anyone leveraged the ability of datanodes to specify which datacenter and rack they live in? if so, any evidence of performance improvements? it seems that rack-awareness is only leveraged in block replication, not in task execution.
It often doesn't make a big improvement for map input, since in the common configuration, map tasks can nearly always be scheduled on nodes where the data is local. However, if you have a large HDFS cluster and overlay smaller mapreduce clusters over subsets of the hosts, then rack-locality can help map input performance too.
Doug
