Andrzej Bialecki pisze:
Bartosz Gadzimski wrote:
As Arkadi said, your hdd is to slow for 2 x quad core processor. I
have the same problem and now thinking of using more boxes or very
fast drives (sas 15k).
Raymond Balm�s pisze:
Well I suspect the sort function is mono-threaded as usually they
are so
only one core is used 25% is the max you will get.
I have a dual core and it only goes to 50% CPU in many of the steps
... I
assumed that some phases are mono-threaded.
Folks,
From your conversation I suspect that you are running Hadoop with
LocalJobtracker, i.e. in a single JVM - correct?
While this works ok for small datasets, you don't really benefit from
map-reduce parallelism (and you still pay the penalty for the
overheads). As your dataset grows, you will quickly reach the
scalability limits - in this case, the limit of IO throughput of a
single drive, during the sort phase of a large dataset. The excessive
IO demands can be solved by distributing the load (over many drives,
and over many machines), which is what HDFS is designed to do well.
Hadoop tasks are usually single-threaded, and additionally
LocalJobTracker implements only a primitive non-parallel model of task
execution - i.e. each task is scheduled to run sequentially in turn.
If you run the regular distributed JobTracker, Hadoop splits the load
among many tasks running in parallel.
So, the solution is this: set up a distributed Hadoop cluster, even if
it's going to consist of a single node - because then the data will be
split and processed in parallel by several JVM instances. This will
also help the operating system to schedule these processes over
multiple CPU-s. Additionally, if you still experience IO contention,
consider moving to HDFS as the filestystem, and spread it over more
than 1 machine and more than 1 disk in each machine.
Thanks Andrzej,
As in your other email. I was using vps (5 vps with 2GB RAM each for
hadoop slave) on dual quad core xeon with 16GB of RAM and it doesn't
work. My I/O is killing everything but it's great configuration for
testing cluster hadoop in distributed mode.
But one question. Is this make sense to use multi-core processors for
hadoop slaves? If everything is about disk random I/O then maybe I
should use single core pentium 4 instead of quad core xeon (which are
expensive)
Thanks,
Bartosz