I'm a newbie to hadoop and HDFS.  I'm seeing odd behavior in HDFS that I hope 
somebody can clear up for me.  I'm running hadoop version 0.20.1+169.127 from 
the cloudera distro on 4 identical nodes, each with 4 cpus and 100GB disk 
space.  Replication is set to 2.

I run:

hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar teragen 50000000 tera_in5

This produces the expected 10GB of data on disk (5GB * 2 copies).  But the data 
is spread very unevenly across the nodes, ranging from 1.7 to 3.2 GB on each 
node.  Then I sort the data:

hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar terasort tera_in5 tera_out5

It finishes successfully, and HDFS recognizes the right amount of data:

$ hadoop fs -du /user/hadoop/
Found 2 items
5000023410  hdfs://namd-1/user/hadoop/tera_in5
5000170993  hdfs://namd-1/user/hadoop/tera_out5

However all the new data is on one node (apparently randomly chosen), and the 
total disk usage is only 15GB, which means that the output data is not 
replicated.  For nearly all the elapsed time of the sort, the other 3 nodes are 
idle.  Some of the output data is in dfs/data/current, but a lot is in one of 
64 new subdirs (dfs/data/current/subdir0 through subdir63).

Why is all this happening?  Am I missing some tunables that make HDFS do the 
right balance and replication?

Thanks,

Jeff

Reply via email to