Jeff Hammerbacher wrote:
thanks for the quick reply. this is an interesting scenario that you bring up: a large hdfs pool across data centers with, perhaps, a jobtracker per data center (or a jobtracker per rack). i'm still not clear how rack-locality helps map input performance here;
You could do it that way, however I was imagining striping sub-clusters across racks, so that, say each of 4 sub-clusters contain 25% of the hosts in each rack.
JobInProgress.findNewTaskseems to see the world as local or other, with no rack-awareness. are you referring to the ReplicationTargetChooser, which will always put one of the three replicas (assuming your replication level is 3) on your rack, hence increasing your chances of finding the block within your sub-jobtracker net?
That works too, but I was thinking that, even if a host in your cluster does not contain a block, a map task could be placed on a host in a rack that contains the block, so the map input would be rack-local.
Doug
