hi raj,

what is a "local node"? is it relative to the tasks that are started?


stijn

On 04/02/2012 07:28 PM, Raj Vishwanathan wrote:
Stijn,

The first block of the data , is always stored in the local node. Assuming that 
you had a replication factor of 3, the node that generates the data will get 
about 10GB of data and the other 20GB will be distributed among other nodes.

Raj





________________________________
From: Stijn De Weirdt<stijn.dewei...@ugent.be>
To: common-user@hadoop.apache.org
Sent: Monday, April 2, 2012 9:54 AM
Subject: data distribution in HDFS

hi all,

i'm just started to play around with hdfs+mapred. i'm currently playing with 
teragen/sort/validate to see if i understand all.

the test setup involves 5 nodes that all are tasktracker and datanode (and one 
node that is also jobtracker and namenode on top of that. (this one node is 
running both the namenode hadoop process as the datanode process)

when i do the in teragen run, the data is not distributed equally over all 
nodes. the node that is also namenode, get's a bigger portion of all the data. 
(as seen by df on the nodes and by using dsfadmin -report)
i also get this distribution when i ran the TestDFSIO write test (50 files of 
1GB)


i use basic command line  teragen $((100*1000*1000)) /benchmarks/teragen, so i 
expect 100M*0.1kb = 10GB of data. (if i add the volumes in use by hdfs, it's 
actually quite a bit more.)
4 data nodes are using 4.2-4.8GB, and the data+namenode has 9.4GB in use. so 
this one datanode is seen as 2 nodes.

when i do ls on the filesystem, i see that teragen created 250MB files, the 
current hdfs blocksize is 64MB.

is there a reason why one datanode is preferred over the others.
it is annoying since the terasort output behaves the same, and i can't use the 
full hdfs space for testing that way. also, since more IO comes to this one 
node, the performance isn't really balanced.

many thanks,

stijn




Reply via email to