On Sep 18, 2007, at 9:28 AM, Ted Dunning wrote:
The key here is that the task farm need not coincide exactly with the
storage farm.
On a large run with an identical hdfs/mapreduce cluster, we see very
high (95%) mapper locality. However, it is usual case that the hdfs
cluster is larger than the map/reduce cluster and so it would be good
to make the map placement rack-aware and that is a recognized goal.
There are a couple of issues with the goal:
1. The network topology is currently hdfs centric and needs to be
generalized. There is a jira for this.
2. The filesystem interface needs to provide rack and node
placement information.
3. The input split interface needs to be generalized to deal with
racks as well as nodes.
4. The job tracker needs to use the rack information to utilize
the rack information.
It is not on my short term radar, but it is on the medium term radar.
However, patches are welcome! *smile*
-- Owen