>Improves performance on the basis that files are copied locally in
>that node, so there is no need network transmission. But isn't that
>policy more weak?  If that node crashes ( he worst case), you loses 1
>redundancy level.
This policy was for better write performance. As you mentioned, yes in your 
case you have only 2 copies and it increases the probability of losing 
replicas. There has been discussion about having different policy for different 
files, but hasn't been implemented yet. 



On 24/03/2008, lohit <[EMAIL PROTECTED]> wrote:
> If your client use to copy is one of the datanodes, then the first copy would 
> go to this datanode(client) and second would be on another random nodes in 
> your cluster. This policy is designed to improve write performance. On the 
> other hand if you would like the data to be distributed, as Ted pointed out, 
> use a node which is not in your cluster as a datanode. In this case, the 
> first copy would be placed on a random node in the cluster because your 
> client is not longer a datanode.
>
>  Thanks,
>
> Lohit
>
>
>  ----- Original Message ----
>  From: Ted Dunning <[EMAIL PROTECTED]>
>  To: core-user@hadoop.apache.org; [EMAIL PROTECTED]
>  Sent: Monday, March 24, 2008 7:40:06 AM
>  Subject: Re: [core] problems while coping files from local file system to dfs
>
>
>
>  Copy from a machine that is *not* running as a data node in order to get
>  better balancing.  Using distcp may also help because the nodes actually
>  doing the copying will be spread across the cluster.
>
>  You should probably be running a rebalancing script as well if your nodes
>  have differing sizes.
>
>
>  On 3/24/08 7:35 AM, "Alfonso Olias Sanz" <[EMAIL PROTECTED]>
>  wrote:
>
>  > Hi
>  >
>  > I want to copy 1000 files (37GB) of data to the dfs.  I have a set up
>  > of 9-10 nodes, each one has between 5 to 15GB of free space.
>  >
>  > While coping the files from the local file system on nodeA, the node
>  > gets full of data and the the process gets stalled.
>  >
>  > I have another free node with 80GB of free space. After adding the
>  > datanode to the cluster, I run again the same copy process
>  >
>  > hadoo dfs  -copyFromLocal ...
>  >
>  > During the copy of these files to the DFS, I have run a java
>  > application in order to check where the data is located (replication
>  > level is set to 2)
>  >
>  > String [][] hostnames = dfs.getFileCacheHints(inFile, 0, 100L);
>  >
>  > The output I print is the following
>  >
>  > File name = GASS.0011.63800-0011.63900.zip
>  > File cache hints =   gaiawl07.net4.lan gaiawl02.net4.lan
>  > ############################################
>  > File name = GASS.0011.53100-0011.53200.zip
>  > File cache hints =   gaiawl03.net4.lan gaiawl02.net4.lan
>  > ############################################
>  > File name = GASS.0011.23800-0011.23900.zip
>  > File cache hints =   gaiawl08.net4.lan gaiawl02.net4.lan
>  > ############################################
>  > File name = GASS.0011.18800-0011.18900.zip
>  > File cache hints =   gaiawl02.net4.lan gaiawl06.net4.lan
>  > ....
>  >
>  > In these small sample  gaiawl02.net4.lan appears for every file, and
>  > this is currently happening for every copied file.    I launch the
>  > copy process from that machine which is also the one which has 80GB of
>  > free space.  I did this because of the problem I pointed previously of
>  > filling up a node and stalling the copy operation.
>  >
>  > Shouldn't be the data dispersed in all the nodes, because if that data
>  > node crashes, only 1 replica of the data is going to exist at the
>  > cluster.
>  >
>  > During the "staging" phase I understand that that perticulary node
>  > contains a local copy of the file being added to the HDFS. But once a
>  > block is filled this doesn't mean that the block has to be also on
>  > that node. Am I right?
>  >
>  > Is it possible to spread the data among all the data nodes to avoid
>  > that a node keeps 1 replica of every copied file?
>  >
>  > thanks
>
>
>
>
>
>



Reply via email to