I run a 512 node hadoop cluster. Yesterday I moved 30Gb of compressed data from a NFS mounted partition by running on the namenode
hadoop fs -copyFromLocal /mnt/data/data1 /mnt/data/data2 mnt/data/data3 hdfs:/data When the job completed the local disk on the namenode was 40% full ( Most of it used by the dfs dierctories) while the others had 1% disk utilization. Just to see if there was an issue, I deleted the hdfs:/data directory and restarted the move from a datanode. Once again the disk space on that data node was substantially over utilized. I would have assumed that the disk space would be more or less uniformly consumed on all the data nodes. Is there a reason why one disk would be over utilized? Do I have to run balancer everytime I copy data? Am I missing something? Raj
