data distribution in HDFS

Stijn De Weirdt Mon, 02 Apr 2012 09:50:17 -0700

hi all,

i'm just started to play around with hdfs+mapred. i'm currently playingwith teragen/sort/validate to see if i understand all.

the test setup involves 5 nodes that all are tasktracker and datanode(and one node that is also jobtracker and namenode on top of that. (thisone node is running both the namenode hadoop process as the datanodeprocess)

when i do the in teragen run, the data is not distributed equally overall nodes. the node that is also namenode, get's a bigger portion of allthe data. (as seen by df on the nodes and by using dsfadmin -report)i also get this distribution when i ran the TestDFSIO write test (50files of 1GB)

i use basic command line teragen $((100*1000*1000))/benchmarks/teragen, so i expect 100M*0.1kb = 10GB of data. (if i addthe volumes in use by hdfs, it's actually quite a bit more.)4 data nodes are using 4.2-4.8GB, and the data+namenode has 9.4GB inuse. so this one datanode is seen as 2 nodes.

when i do ls on the filesystem, i see that teragen created 250MB files,the current hdfs blocksize is 64MB.


is there a reason why one datanode is preferred over the others.

it is annoying since the terasort output behaves the same, and i can'tuse the full hdfs space for testing that way. also, since more IO comesto this one node, the performance isn't really balanced.


many thanks,

stijn

data distribution in HDFS

Reply via email to