I'm testing some scenarios where a very small cluster needs to replicate a lot 
of data due to a node going down.  I'm observing pretty slow performance, and 
it seems like each node is sending one or two blocks at a time, with lots of 
downtime where the network interface is idle for 1-2 seconds before starting 
another block copy.



I've looked around for how to tune this, and setting these doesn't seem to help 
increase threads or throughput.

  dfs.replication.interval  (tried 3,1)

  dfs.namenode.handler.count  (tried 10,30)

  dfs.datanode.handler.count  (tried 25)

  dfs.replication.considerLoad  (true or false)

Nothing seems to change the behavior.  It looks like each node with data to 
send opens connections to each other node capable of receiving.  This is 
probably fine for normal-sized clusters, but when there are 3 nodes and one 
goes down, the effect is that the two remaining nodes will transfer to each 
other at about half the rate they are capable of.



If there's something I'm missing, please let me know.  If not, and this 
behavior is hard-coded in, could I resolve this by running more instances of 
"datanode" on each client?  Has anyone ever done this successfully?



Thanks

gregc

Reply via email to