Stijn,

The first block of the data , is always stored in the local node. Assuming that 
you had a replication factor of 3, the node that generates the data will get 
about 10GB of data and the other 20GB will be distributed among other nodes.

RajĀ 





>________________________________
> From: Stijn De Weirdt <stijn.dewei...@ugent.be>
>To: common-user@hadoop.apache.org 
>Sent: Monday, April 2, 2012 9:54 AM
>Subject: data distribution in HDFS
> 
>hi all,
>
>i'm just started to play around with hdfs+mapred. i'm currently playing with 
>teragen/sort/validate to see if i understand all.
>
>the test setup involves 5 nodes that all are tasktracker and datanode (and one 
>node that is also jobtracker and namenode on top of that. (this one node is 
>running both the namenode hadoop process as the datanode process)
>
>when i do the in teragen run, the data is not distributed equally over all 
>nodes. the node that is also namenode, get's a bigger portion of all the data. 
>(as seen by df on the nodes and by using dsfadmin -report)
>i also get this distribution when i ran the TestDFSIO write test (50 files of 
>1GB)
>
>
>i use basic command lineĀ  teragen $((100*1000*1000)) /benchmarks/teragen, so i 
>expect 100M*0.1kb = 10GB of data. (if i add the volumes in use by hdfs, it's 
>actually quite a bit more.)
>4 data nodes are using 4.2-4.8GB, and the data+namenode has 9.4GB in use. so 
>this one datanode is seen as 2 nodes.
>
>when i do ls on the filesystem, i see that teragen created 250MB files, the 
>current hdfs blocksize is 64MB.
>
>is there a reason why one datanode is preferred over the others.
>it is annoying since the terasort output behaves the same, and i can't use the 
>full hdfs space for testing that way. also, since more IO comes to this one 
>node, the performance isn't really balanced.
>
>many thanks,
>
>stijn
>
>
>

Reply via email to