Re: Merge taking forever

Justin Yao Thu, 11 Jun 2009 18:26:10 -0700

Hi John,

I had the same issue before but never found a solution.
Here is a workaround mentioned by someone in this mailing list, you may have
a try:


Seemingly abnormal temp space use by segment merger
http://www.nabble.com/Content(source-code)-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569

Regards,
Justin

On Sat, Jun 6, 2009 at 4:09 PM, John Martyniak <j...@beforedawnsolutions.com
> wrote:

> Ok.
>
> So a update to this item.
>
> I did start running nutch with hadoop, I am trying a single node config
> just to test it out.
>
> It took forever to get all of the files in the DFS it was just over 80GB
> but it is in there.  So I started the SegmentMerge job, and it is working
> flawlessly, still a little slow though.
>
> Also looking at the stats for the CPU they sometimes go over 20% by not by
> much and not often, the Disk is very lightly taxed, peak was about 20
> MB/sec, the drives and interface are rated at 3 GB/sec, so no issue there.
>
> I tried to set the map jobs to 7 and the reduce jobs to 3, but when I
> restarted all it is still only using 2 and 1.  Any ideas?  I made that
> change in the hadoop-site.xml file BTW.
>
> -John
>
>
> On Jun 4, 2009, at 10:00 AM, Andrzej Bialecki wrote:
>
>  John Martyniak wrote:
>>
>>> Andrzej,
>>> I am a little embarassed asking.  But is there is a setup guide for
>>> setting up Hadoop for Nutch 1.0, or is it the same process as setting up for
>>> Nutch 0.17 (Which I think is the existing guide out there).
>>>
>>
>> Basically, yes - but this guide is primarily about the set up of Hadoop
>> cluster using the Hadoop pieces distributed with Nutch. As such these
>> instructions are already slightly outdated. So it's best simply to install a
>> clean Hadoop 0.19.1 according to the instructions on Hadoop wiki, and then
>> build nutch*.job file separately.
>>
>>  Also I have Hadoop already running for some other applications, not
>>> associated with Nutch, can I use the same install?  I think that it is the
>>> same version that Nutch 1.0 uses.  Or is it just easier to set it up using
>>> the nutch config.
>>>
>>
>> Yes, it's perfectly ok to use Nutch with an existing Hadoop cluster of the
>> same vintage (which is 0.19.1 in Nutch 1.0). In fact, I would strongly
>> recommend this way, instead of the usual "dirty" way of setting up Nutch by
>> replicating the local build dir ;)
>>
>> Just specify the nutch*.job file like this:
>>
>>        bin/hadoop jar nutch*.job <className> <args ..>
>>
>> where className and args is one of Nutch command-line tools. You can also
>> modify slightly the bin/nutch script, so that you don't have to specify
>> fully-qualified class names.
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>> ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>

Re: Merge taking forever

Reply via email to