Hi I want to copy 1000 files (37GB) of data to the dfs. I have a set up of 9-10 nodes, each one has between 5 to 15GB of free space.
While coping the files from the local file system on nodeA, the node gets full of data and the the process gets stalled. I have another free node with 80GB of free space. After adding the datanode to the cluster, I run again the same copy process hadoo dfs -copyFromLocal ... During the copy of these files to the DFS, I have run a java application in order to check where the data is located (replication level is set to 2) String [][] hostnames = dfs.getFileCacheHints(inFile, 0, 100L); The output I print is the following File name = GASS.0011.63800-0011.63900.zip File cache hints = gaiawl07.net4.lan gaiawl02.net4.lan ############################################ File name = GASS.0011.53100-0011.53200.zip File cache hints = gaiawl03.net4.lan gaiawl02.net4.lan ############################################ File name = GASS.0011.23800-0011.23900.zip File cache hints = gaiawl08.net4.lan gaiawl02.net4.lan ############################################ File name = GASS.0011.18800-0011.18900.zip File cache hints = gaiawl02.net4.lan gaiawl06.net4.lan .... In these small sample gaiawl02.net4.lan appears for every file, and this is currently happening for every copied file. I launch the copy process from that machine which is also the one which has 80GB of free space. I did this because of the problem I pointed previously of filling up a node and stalling the copy operation. Shouldn't be the data dispersed in all the nodes, because if that data node crashes, only 1 replica of the data is going to exist at the cluster. During the "staging" phase I understand that that perticulary node contains a local copy of the file being added to the HDFS. But once a block is filled this doesn't mean that the block has to be also on that node. Am I right? Is it possible to spread the data among all the data nodes to avoid that a node keeps 1 replica of every copied file? thanks