I'm testing some scenarios where a very small cluster needs to replicate a lot of data due to a node going down. I'm observing pretty slow performance, and it seems like each node is sending one or two blocks at a time, with lots of downtime where the network interface is idle for 1-2 seconds before starting another block copy.
I've looked around for how to tune this, and setting these doesn't seem to help increase threads or throughput. dfs.replication.interval (tried 3,1) dfs.namenode.handler.count (tried 10,30) dfs.datanode.handler.count (tried 25) dfs.replication.considerLoad (true or false) Nothing seems to change the behavior. It looks like each node with data to send opens connections to each other node capable of receiving. This is probably fine for normal-sized clusters, but when there are 3 nodes and one goes down, the effect is that the two remaining nodes will transfer to each other at about half the rate they are capable of. If there's something I'm missing, please let me know. If not, and this behavior is hard-coded in, could I resolve this by running more instances of "datanode" on each client? Has anyone ever done this successfully? Thanks gregc