hey doug,

thanks for the quick reply.  this is an interesting scenario that you bring
up: a large hdfs pool across data centers with, perhaps, a jobtracker per
data center (or a jobtracker per rack).  i'm still not clear how
rack-locality helps map input performance here;
JobInProgress.findNewTaskseems to see the world as local or other,
with no rack-awareness.  are you
referring to the ReplicationTargetChooser, which will always put one of the
three replicas (assuming your replication level is 3) on your rack, hence
increasing your chances of finding the block within your sub-jobtracker net?

thanks,
jeff

On 9/17/07, Doug Cutting <[EMAIL PROTECTED]> wrote:
>
> Jeff Hammerbacher wrote:
> > has anyone leveraged the ability of datanodes to specify which
> datacenter
> > and rack they live in?  if so, any evidence of performance
> improvements?  it
> > seems that rack-awareness is only leveraged in block replication, not in
> > task execution.
>
> It often doesn't make a big improvement for map input, since in the
> common configuration, map tasks can nearly always be scheduled on nodes
> where the data is local.  However, if you have a large HDFS cluster and
> overlay smaller mapreduce clusters over subsets of the hosts, then
> rack-locality can help map input performance too.
>
> Doug
>

Reply via email to