jafarim wrote:
On linux and jvm6 with normal IDE disks and a giga ethernet switch with
corresponding NIC and with hadoop 0.9.11's HDFS. We wrote a C program by
using the native libs provided in the package but then we tested again with
distcp. The scenario was as follows:
We ran the test on a cluster with 1 node, then we added the nodes one by
one
until reaching 5 nodes. Same test with samba saturated the link with only
one node.
How big were the files you were copying? The distcp task uses mapreduce
to copy each file as a separate task. Each task launches in a new JVM,
and the tasktrackers only poll for new tasks every few seconds. So,
with smaller files it would not be able to saturate a gigabit switch.
Ideally each file should take 10 seconds or more to copy. With a
gigabit switch, this means a 1GB minimum filesize.
You could also try the single-threaded 'bin/hadoop hdfs -put'.
A comparison with Samba is not entirely fair, since HDFS provides
different features. For example, HDFS normally replicates data on three
nodes, so writes consume twice or three times the bandwidth (depending
on whether the source node is a datanode with space available).
Finally, 0.9 is a pretty old release. Hadoop's performance and
reliability has improved substantially in the past few months.
Doug