Thanks for all of the input, I was leaning towards setting up hadoop cluster for this, as the data set is getting quite large and creating indexes etc, is taking longer and longer.

My other option would be to setup several Virtual Private Servers across the two boxes and then run hadoop cluster on all of the VPS, so in effect I could create 4, 6, 8 nodes running on two physical boxes, has anyone tried something like this. Would this reduce the amount of Disk contention? or would it make any difference and it is better just to have a two node cluster.

Thanks again for all of the help.

-John


On Jun 4, 2009, at 7:47 AM, Andrzej Bialecki wrote:

Bartosz Gadzimski wrote:
As Arkadi said, your hdd is to slow for 2 x quad core processor. I have the same problem and now thinking of using more boxes or very fast drives (sas 15k).
Raymond Balmčs pisze:
Well I suspect the sort function is mono-threaded as usually they are so
only one core is used 25% is the max you will get.
I have a dual core and it only goes to 50% CPU in many of the steps ... I
assumed that some phases are mono-threaded.

Folks,

From your conversation I suspect that you are running Hadoop with LocalJobtracker, i.e. in a single JVM - correct?

While this works ok for small datasets, you don't really benefit from map-reduce parallelism (and you still pay the penalty for the overheads). As your dataset grows, you will quickly reach the scalability limits - in this case, the limit of IO throughput of a single drive, during the sort phase of a large dataset. The excessive IO demands can be solved by distributing the load (over many drives, and over many machines), which is what HDFS is designed to do well.

Hadoop tasks are usually single-threaded, and additionally LocalJobTracker implements only a primitive non-parallel model of task execution - i.e. each task is scheduled to run sequentially in turn. If you run the regular distributed JobTracker, Hadoop splits the load among many tasks running in parallel.

So, the solution is this: set up a distributed Hadoop cluster, even if it's going to consist of a single node - because then the data will be split and processed in parallel by several JVM instances. This will also help the operating system to schedule these processes over multiple CPU-s. Additionally, if you still experience IO contention, consider moving to HDFS as the filestystem, and spread it over more than 1 machine and more than 1 disk in each machine.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


John Martyniak
President/CEO
Before Dawn Solutions, Inc.
9457 S. University Blvd #266
Highlands Ranch, CO 80126
o: 877-499-1562
c: 303-522-1756
e: j...@beforedawnsoutions.com
w: http://www.beforedawnsolutions.com

Reply via email to