Stijn, The first block of the data , is always stored in the local node. Assuming that you had a replication factor of 3, the node that generates the data will get about 10GB of data and the other 20GB will be distributed among other nodes.
RajĀ >________________________________ > From: Stijn De Weirdt <stijn.dewei...@ugent.be> >To: common-user@hadoop.apache.org >Sent: Monday, April 2, 2012 9:54 AM >Subject: data distribution in HDFS > >hi all, > >i'm just started to play around with hdfs+mapred. i'm currently playing with >teragen/sort/validate to see if i understand all. > >the test setup involves 5 nodes that all are tasktracker and datanode (and one >node that is also jobtracker and namenode on top of that. (this one node is >running both the namenode hadoop process as the datanode process) > >when i do the in teragen run, the data is not distributed equally over all >nodes. the node that is also namenode, get's a bigger portion of all the data. >(as seen by df on the nodes and by using dsfadmin -report) >i also get this distribution when i ran the TestDFSIO write test (50 files of >1GB) > > >i use basic command lineĀ teragen $((100*1000*1000)) /benchmarks/teragen, so i >expect 100M*0.1kb = 10GB of data. (if i add the volumes in use by hdfs, it's >actually quite a bit more.) >4 data nodes are using 4.2-4.8GB, and the data+namenode has 9.4GB in use. so >this one datanode is seen as 2 nodes. > >when i do ls on the filesystem, i see that teragen created 250MB files, the >current hdfs blocksize is 64MB. > >is there a reason why one datanode is preferred over the others. >it is annoying since the terasort output behaves the same, and i can't use the >full hdfs space for testing that way. also, since more IO comes to this one >node, the performance isn't really balanced. > >many thanks, > >stijn > > >