Bwolen,
First of all, Hadoop is not optimized for small cluster or small bursts
of writes/reads. There are some costs (like storing a copy locally and
copying it locally) that don't have benefits for small clusters with .
You could try using different disks (not just partitions) for tmp
directory for Maps and for Datanode.
To compare single node write with Hadoop, you should run 'bin/hadoop
-copyFromLocal - test' and pipe your dd command output there. May be you
will see 25% of 75MB you saw with native write. That is not unexpected.
Not sure if you want to know all the details of why it is so. In your
test you have many other one time costs of starting and stopping jobs etc.
I don't mean to say Hadoop can't do better.. its performance is steadily
improving. But your expectations for toy application might be off.
If you want to figure out what the problem could be, you could start
with 'copyFromLocal' example above. Here you need to figure our what
Datanode process and Hadoop shell are doing at verious time (may be with
stack traces).
Raghu.
Bwolen Yang wrote:
Please try Hadoop 0.13.0. I don't know whether it will address your
concerns, but it should be faster and is much closer to what developers
are currently working on.
ok. It would also be good to see how DFS upgrade go between versions.
(looks like it got released today. cool.)
For such a small cluster you'd probably be better running the jobtracker
and namenode on the same node and gain another slave.
When namenode and jobtracker were running on the same machine, I
notice failures due to losing contact with jobtracker. This is why I
split the machines.
With regard to the performance details, it is really independent of
how many slaves I have. The test is mainly trying to see how close
Hadoop compares to single node or scp, and what are the tuning
parameters to make things run faster.
Any suggestions on java profiling tools?
bwolen