If your client use to copy is one of the datanodes, then the first copy would 
go to this datanode(client) and second would be on another random nodes in your 
cluster. This policy is designed to improve write performance. On the other 
hand if you would like the data to be distributed, as Ted pointed out, use a 
node which is not in your cluster as a datanode. In this case, the first copy 
would be placed on a random node in the cluster because your client is not 
longer a datanode.

Thanks,
Lohit

----- Original Message ----
From: Ted Dunning <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org; [EMAIL PROTECTED]
Sent: Monday, March 24, 2008 7:40:06 AM
Subject: Re: [core] problems while coping files from local file system to dfs



Copy from a machine that is *not* running as a data node in order to get
better balancing.  Using distcp may also help because the nodes actually
doing the copying will be spread across the cluster.

You should probably be running a rebalancing script as well if your nodes
have differing sizes.


On 3/24/08 7:35 AM, "Alfonso Olias Sanz" <[EMAIL PROTECTED]>
wrote:

> Hi
> 
> I want to copy 1000 files (37GB) of data to the dfs.  I have a set up
> of 9-10 nodes, each one has between 5 to 15GB of free space.
> 
> While coping the files from the local file system on nodeA, the node
> gets full of data and the the process gets stalled.
> 
> I have another free node with 80GB of free space. After adding the
> datanode to the cluster, I run again the same copy process
> 
> hadoo dfs  -copyFromLocal ...
> 
> During the copy of these files to the DFS, I have run a java
> application in order to check where the data is located (replication
> level is set to 2)
> 
> String [][] hostnames = dfs.getFileCacheHints(inFile, 0, 100L);
> 
> The output I print is the following
> 
> File name = GASS.0011.63800-0011.63900.zip
> File cache hints =   gaiawl07.net4.lan gaiawl02.net4.lan
> ############################################
> File name = GASS.0011.53100-0011.53200.zip
> File cache hints =   gaiawl03.net4.lan gaiawl02.net4.lan
> ############################################
> File name = GASS.0011.23800-0011.23900.zip
> File cache hints =   gaiawl08.net4.lan gaiawl02.net4.lan
> ############################################
> File name = GASS.0011.18800-0011.18900.zip
> File cache hints =   gaiawl02.net4.lan gaiawl06.net4.lan
> ....
> 
> In these small sample  gaiawl02.net4.lan appears for every file, and
> this is currently happening for every copied file.    I launch the
> copy process from that machine which is also the one which has 80GB of
> free space.  I did this because of the problem I pointed previously of
> filling up a node and stalling the copy operation.
> 
> Shouldn't be the data dispersed in all the nodes, because if that data
> node crashes, only 1 replica of the data is going to exist at the
> cluster.
> 
> During the "staging" phase I understand that that perticulary node
> contains a local copy of the file being added to the HDFS. But once a
> block is filled this doesn't mean that the block has to be also on
> that node. Am I right?
> 
> Is it possible to spread the data among all the data nodes to avoid
> that a node keeps 1 replica of every copied file?
> 
> thanks





Reply via email to