Bwolen,

First of all, Hadoop is not optimized for small cluster or small bursts of writes/reads. There are some costs (like storing a copy locally and copying it locally) that don't have benefits for small clusters with .

You could try using different disks (not just partitions) for tmp directory for Maps and for Datanode.

To compare single node write with Hadoop, you should run 'bin/hadoop -copyFromLocal - test' and pipe your dd command output there. May be you will see 25% of 75MB you saw with native write. That is not unexpected. Not sure if you want to know all the details of why it is so. In your test you have many other one time costs of starting and stopping jobs etc.

I don't mean to say Hadoop can't do better.. its performance is steadily improving. But your expectations for toy application might be off.

If you want to figure out what the problem could be, you could start with 'copyFromLocal' example above. Here you need to figure our what Datanode process and Hadoop shell are doing at verious time (may be with stack traces).

Raghu.

Bwolen Yang wrote:
Please try Hadoop 0.13.0.  I don't know whether it will address your
concerns, but it should be faster and is much closer to what developers
are currently working on.

ok. It would also be good to see how DFS upgrade go between versions.
(looks like it got released today.  cool.)


For such a small cluster you'd probably be better running the jobtracker
and namenode on the same node and gain another slave.

When namenode and jobtracker were running on the same machine, I
notice failures due to losing contact with jobtracker.  This is why I
split the machines.

With regard to the performance details, it is really independent of
how many slaves I have.   The test is mainly trying to see how close
Hadoop compares to single node or scp, and what are the tuning
parameters to make things run faster.

Any suggestions on java profiling tools?

bwolen

Reply via email to