Re: data locality in HDFS

Dhruba Borthakur Wed, 18 Jun 2008 06:34:34 -0700

HDFS uses the network topology to distribute and replicate data. An
admin has to configure a script that describes the network topology to
HDFS. This is specified by using the parameter
"topology.script.file.name" in the Configuration file. This has been
tested when nodes are on different subnets in the same data center.

This code might not be generic (and is not yet tested) to support
multiple-data centers.

One can extend this topology by implementing one's own implementation
and specifying the new jar using the config parameter
topology.node.switch.mapping.impl. You will find more details at
http://hadoop.apache.org/core/docs/current/cluster_setup.html#Hadoop+Rack+Awareness

thanks,
dhruba

On Tue, Jun 17, 2008 at 10:18 PM, Ian Holsman (Lists) <[EMAIL PROTECTED]> wrote:
> hi.
>
> I want to run a distributed cluster, where i have say 20 machines/slaves in
> 3 seperate data centers that belong to the same cluster.
>
> Ideally I would like the other machines in the data center to be able to
> upload files (apache log files in this case) onto the local slaves and then
> have map/red tasks do their magic without having to move data until the
> reduce phase where the amount of data will be smaller.
>
> does Hadoop have this functionality?
> how do people handle multi-datacenter logging with hadoop in this case? do
> you just copy the data into a centeral location?
>
> regards
> Ian
>

Re: data locality in HDFS

Reply via email to