Re: bandwidth (Was: Re: Running on multiple CPU's)

Doug Cutting Mon, 16 Apr 2007 10:49:43 -0700

jafarim wrote:

On linux and jvm6 with normal IDE disks and a giga ethernet switch with
corresponding NIC and with hadoop 0.9.11's HDFS. We wrote a C program by
using the native libs provided in the package but then we tested again with
distcp. The scenario was as follows:

We ran the test on a cluster with 1 node, then we added the nodes one byone

until reaching 5 nodes. Same test with samba saturated the link with only
one node.

How big were the files you were copying? The distcp task uses mapreduceto copy each file as a separate task. Each task launches in a new JVM,and the tasktrackers only poll for new tasks every few seconds. So,with smaller files it would not be able to saturate a gigabit switch.Ideally each file should take 10 seconds or more to copy. With agigabit switch, this means a 1GB minimum filesize.


You could also try the single-threaded 'bin/hadoop hdfs -put'.

A comparison with Samba is not entirely fair, since HDFS providesdifferent features. For example, HDFS normally replicates data on threenodes, so writes consume twice or three times the bandwidth (dependingon whether the source node is a datanode with space available).

Finally, 0.9 is a pretty old release. Hadoop's performance andreliability has improved substantially in the past few months.


Doug

Re: bandwidth (Was: Re: Running on multiple CPU's)

Reply via email to