Re: Merge taking forever

John Martyniak Thu, 04 Jun 2009 05:13:03 -0700

Thanks for all of the input, I was leaning towards setting up hadoopcluster for this, as the data set is getting quite large and creatingindexes etc, is taking longer and longer.

My other option would be to setup several Virtual Private Serversacross the two boxes and then run hadoop cluster on all of the VPS, soin effect I could create 4, 6, 8 nodes running on two physical boxes,has anyone tried something like this. Would this reduce the amount ofDisk contention? or would it make any difference and it is betterjust to have a two node cluster.


Thanks again for all of the help.

-John


On Jun 4, 2009, at 7:47 AM, Andrzej Bialecki wrote:

Bartosz Gadzimski wrote:
As Arkadi said, your hdd is to slow for 2 x quad core processor. Ihave the same problem and now thinking of using more boxes or veryfast drives (sas 15k).
Raymond Balmčs pisze:
Well I suspect the sort function is mono-threaded as usually theyare so
only one core is used 25% is the max you will get.
I have a dual core and it only goes to 50% CPU in many of thesteps ... I
assumed that some phases are mono-threaded.
Folks,
From your conversation I suspect that you are running Hadoop withLocalJobtracker, i.e. in a single JVM - correct?
While this works ok for small datasets, you don't really benefitfrom map-reduce parallelism (and you still pay the penalty for theoverheads). As your dataset grows, you will quickly reach thescalability limits - in this case, the limit of IO throughput of asingle drive, during the sort phase of a large dataset. Theexcessive IO demands can be solved by distributing the load (overmany drives, and over many machines), which is what HDFS is designedto do well.
Hadoop tasks are usually single-threaded, and additionallyLocalJobTracker implements only a primitive non-parallel model oftask execution - i.e. each task is scheduled to run sequentially inturn. If you run the regular distributed JobTracker, Hadoop splitsthe load among many tasks running in parallel.
So, the solution is this: set up a distributed Hadoop cluster, evenif it's going to consist of a single node - because then the datawill be split and processed in parallel by several JVM instances.This will also help the operating system to schedule these processesover multiple CPU-s. Additionally, if you still experience IOcontention, consider moving to HDFS as the filestystem, and spreadit over more than 1 machine and more than 1 disk in each machine.
--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


John Martyniak
President/CEO
Before Dawn Solutions, Inc.
9457 S. University Blvd #266
Highlands Ranch, CO 80126
o: 877-499-1562
c: 303-522-1756
e: j...@beforedawnsoutions.com
w: http://www.beforedawnsolutions.com

Re: Merge taking forever

Reply via email to