Re: Merge taking forever

Andrzej Bialecki Thu, 04 Jun 2009 08:53:02 -0700

Bartosz Gadzimski wrote:

Andrzej Bialecki pisze:
Bartosz Gadzimski wrote:
As Arkadi said, your hdd is to slow for 2 x quad core processor. Ihave the same problem and now thinking of using more boxes or veryfast drives (sas 15k).
Raymond BalmÄ�s pisze:
Well I suspect the sort function is mono-threaded as usually theyare so
only one core is used 25% is the max you will get.
I have a dual core and it only goes to 50% CPU in many of the steps... I
assumed that some phases are mono-threaded.
Folks,
From your conversation I suspect that you are running Hadoop withLocalJobtracker, i.e. in a single JVM - correct?
While this works ok for small datasets, you don't really benefit frommap-reduce parallelism (and you still pay the penalty for theoverheads). As your dataset grows, you will quickly reach thescalability limits - in this case, the limit of IO throughput of asingle drive, during the sort phase of a large dataset. The excessiveIO demands can be solved by distributing the load (over many drives,and over many machines), which is what HDFS is designed to do well.
Hadoop tasks are usually single-threaded, and additionallyLocalJobTracker implements only a primitive non-parallel model of taskexecution - i.e. each task is scheduled to run sequentially in turn.If you run the regular distributed JobTracker, Hadoop splits the loadamong many tasks running in parallel.
So, the solution is this: set up a distributed Hadoop cluster, even ifit's going to consist of a single node - because then the data will besplit and processed in parallel by several JVM instances. This willalso help the operating system to schedule these processes overmultiple CPU-s. Additionally, if you still experience IO contention,consider moving to HDFS as the filestystem, and spread it over morethan 1 machine and more than 1 disk in each machine.
Thanks Andrzej,
As in your other email. I was using vps (5 vps with 2GB RAM each forhadoop slave) on dual quad core xeon with 16GB of RAM and it doesn'twork. My I/O is killing everything but it's great configuration fortesting cluster hadoop in distributed mode.
But one question. Is this make sense to use multi-core processors forhadoop slaves? If everything is about disk random I/O then maybe Ishould use single core pentium 4 instead of quad core xeon (which areexpensive)

Usually you will run several tasks on a single node, this typically maybe 3-5 map tasks and 1-3 reduce tasks. Each task runs in a separateprocess (JVM). Each task may be scheduled to run on a different CPU.

So, if your map-reduce applications are CPU-bound, then you will greatlybenefit from multiple cores. If they are IO-bound, as most Nutchprocessing is (except perhaps for parsing), then the benefit won't be asbig beyond 2 cores (OS kernel usually executes on a single core, so it'sgood to have a completely free spare CPU for running user apps).

Look at the top(1) output, and see what is the CPU, disk IO, and swapusage during the worst load. Chances are that you will see thatmap-reduce tasks consume relatively little CPU and that they areprimarily disk-bound (or network-bound). So instead of using quadsperhaps it's better instead to put 2 disk drives in a striped config.

When configuring Hadoop cluster there is a sweet spot somewhere betweenthe available disk IO, CPU and RAM per node, assuming a fixed budget forthe whole cluster. This sweet spot has been the previous year's mid- tohigh-level hardware (not the top-level), which fits the experience ofGoogle. For example, now I would test how the following hardware wouldperform for you: a decent mobo fitted with a dual core Intel @ 2GHz,4-8GB RAM and 2 SATA2 512GB drives, times N nodes (depending on the sizeof your crawl).


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Merge taking forever

Reply via email to