jafarim wrote:
On linux and jvm6 with normal IDE disks and a giga ethernet switch with
corresponding NIC and with hadoop 0.9.11's HDFS. We wrote a C program by
using the native libs provided in the package but then we tested again with
distcp. The scenario was as follows:
We ran the test on a cluster with 1 node, then we added the nodes one by one
until reaching 5 nodes. Same test with samba saturated the link with only
one node.

How big were the files you were copying? The distcp task uses mapreduce to copy each file as a separate task. Each task launches in a new JVM, and the tasktrackers only poll for new tasks every few seconds. So, with smaller files it would not be able to saturate a gigabit switch. Ideally each file should take 10 seconds or more to copy. With a gigabit switch, this means a 1GB minimum filesize.

You could also try the single-threaded 'bin/hadoop hdfs -put'.

A comparison with Samba is not entirely fair, since HDFS provides different features. For example, HDFS normally replicates data on three nodes, so writes consume twice or three times the bandwidth (depending on whether the source node is a datanode with space available).

Finally, 0.9 is a pretty old release. Hadoop's performance and reliability has improved substantially in the past few months.

Doug

Reply via email to