Re: Incremental Whole Web Crawling

Paul Tomblin Tue, 06 Oct 2009 05:03:17 -0700

Don't change options in nutch-default.xml - copy the option into
nutch-site.xml and change it there.  That way the change will
(hopefully) survive an upgrade.


On Tue, Oct 6, 2009 at 1:01 AM, Gaurang Patel <gaurangtpa...@gmail.com> wrote:
> Hey,
>
> Never mind. I got *generate.update.db* in *nutch-default.xml* and set it
> true.
>
> Regards,
> Gaurang
>
> 2009/10/5 Gaurang Patel <gaurangtpa...@gmail.com>
>
>> Hey Andrzej,
>>
>> Can you tell me where to set this property (generate.update.db)? I am
>> trying to run similar kind of crawl scenario that Eric is running.
>>
>> -Gaurang
>>
>> 2009/10/5 Andrzej Bialecki <a...@getopt.org>
>>
>> Eric wrote:
>>>
>>>> Andrzej,
>>>>
>>>> Just to make sure I have this straight, set the generate.update.db
>>>> property to true then
>>>>
>>>> bin/nutch generate crawl/crawldb crawl/segments -topN 100000: 16 times?
>>>>
>>>
>>> Yes. When this property is set to true, then each fetchlist will be
>>> different, because the records for those pages that are already on another
>>> fetchlist will be temporarily locked. Please note that this lock holds only
>>> for 1 week, so you need to fetch all segments within one week from
>>> generating them.
>>>
>>> You can fetch and updatedb in arbitrary order, so once you fetched some
>>> segments you can run the parsing and updatedb just from these segments,
>>> without waiting for all 16 segments to be processed.
>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Andrzej Bialecki     <><
>>>  ___. ___ ___ ___ _ _   __________________________________
>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>> http://www.sigram.com  Contact: info at sigram dot com
>>>
>>>
>>
>



-- 
http://www.linkedin.com/in/paultomblin

Re: Incremental Whole Web Crawling

Reply via email to