When I set generate.update.db to true and then run generate, it only runs twice and generates 100K for the 1st gen, 62.5K for the second gen and 0 for the 3rd gen on a seed list of 1.6M. I don't understand this, for a topN of 100K it should run 16 times and create 16 distinct lists if I am not mistaken.

Eric


On Oct 5, 2009, at 10:01 PM, Gaurang Patel wrote:

Hey,

Never mind. I got *generate.update.db* in *nutch-default.xml* and set it
true.

Regards,
Gaurang

2009/10/5 Gaurang Patel <gaurangtpa...@gmail.com>

Hey Andrzej,

Can you tell me where to set this property (generate.update.db)? I am
trying to run similar kind of crawl scenario that Eric is running.

-Gaurang

2009/10/5 Andrzej Bialecki <a...@getopt.org>

Eric wrote:

Andrzej,

Just to make sure I have this straight, set the generate.update.db
property to true then

bin/nutch generate crawl/crawldb crawl/segments -topN 100000: 16 times?


Yes. When this property is set to true, then each fetchlist will be
different, because the records for those pages that are already on another fetchlist will be temporarily locked. Please note that this lock holds only
for 1 week, so you need to fetch all segments within one week from
generating them.

You can fetch and updatedb in arbitrary order, so once you fetched some segments you can run the parsing and updatedb just from these segments,
without waiting for all 16 segments to be processed.



--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Eric Osgood
---------------------------------------------
Cal Poly - Computer Engineering, Moon Valley Software
---------------------------------------------
eosg...@calpoly.edu, e...@lakemeadonline.com
---------------------------------------------
www.calpoly.edu/~eosgood, www.lakemeadonline.com

Reply via email to