>Improves performance on the basis that files are copied locally in >that node, so there is no need network transmission. But isn't that >policy more weak? If that node crashes ( he worst case), you loses 1 >redundancy level. This policy was for better write performance. As you mentioned, yes in your case you have only 2 copies and it increases the probability of losing replicas. There has been discussion about having different policy for different files, but hasn't been implemented yet.
On 24/03/2008, lohit <[EMAIL PROTECTED]> wrote: > If your client use to copy is one of the datanodes, then the first copy would > go to this datanode(client) and second would be on another random nodes in > your cluster. This policy is designed to improve write performance. On the > other hand if you would like the data to be distributed, as Ted pointed out, > use a node which is not in your cluster as a datanode. In this case, the > first copy would be placed on a random node in the cluster because your > client is not longer a datanode. > > Thanks, > > Lohit > > > ----- Original Message ---- > From: Ted Dunning <[EMAIL PROTECTED]> > To: core-user@hadoop.apache.org; [EMAIL PROTECTED] > Sent: Monday, March 24, 2008 7:40:06 AM > Subject: Re: [core] problems while coping files from local file system to dfs > > > > Copy from a machine that is *not* running as a data node in order to get > better balancing. Using distcp may also help because the nodes actually > doing the copying will be spread across the cluster. > > You should probably be running a rebalancing script as well if your nodes > have differing sizes. > > > On 3/24/08 7:35 AM, "Alfonso Olias Sanz" <[EMAIL PROTECTED]> > wrote: > > > Hi > > > > I want to copy 1000 files (37GB) of data to the dfs. I have a set up > > of 9-10 nodes, each one has between 5 to 15GB of free space. > > > > While coping the files from the local file system on nodeA, the node > > gets full of data and the the process gets stalled. > > > > I have another free node with 80GB of free space. After adding the > > datanode to the cluster, I run again the same copy process > > > > hadoo dfs -copyFromLocal ... > > > > During the copy of these files to the DFS, I have run a java > > application in order to check where the data is located (replication > > level is set to 2) > > > > String [][] hostnames = dfs.getFileCacheHints(inFile, 0, 100L); > > > > The output I print is the following > > > > File name = GASS.0011.63800-0011.63900.zip > > File cache hints = gaiawl07.net4.lan gaiawl02.net4.lan > > ############################################ > > File name = GASS.0011.53100-0011.53200.zip > > File cache hints = gaiawl03.net4.lan gaiawl02.net4.lan > > ############################################ > > File name = GASS.0011.23800-0011.23900.zip > > File cache hints = gaiawl08.net4.lan gaiawl02.net4.lan > > ############################################ > > File name = GASS.0011.18800-0011.18900.zip > > File cache hints = gaiawl02.net4.lan gaiawl06.net4.lan > > .... > > > > In these small sample gaiawl02.net4.lan appears for every file, and > > this is currently happening for every copied file. I launch the > > copy process from that machine which is also the one which has 80GB of > > free space. I did this because of the problem I pointed previously of > > filling up a node and stalling the copy operation. > > > > Shouldn't be the data dispersed in all the nodes, because if that data > > node crashes, only 1 replica of the data is going to exist at the > > cluster. > > > > During the "staging" phase I understand that that perticulary node > > contains a local copy of the file being added to the HDFS. But once a > > block is filled this doesn't mean that the block has to be also on > > that node. Am I right? > > > > Is it possible to spread the data among all the data nodes to avoid > > that a node keeps 1 replica of every copied file? > > > > thanks > > > > > >