I'm glad you got the slowness issue straightened out.
 
When you import the dmoz urls into your Nutch DB, the "-subset" command isn't 
really meant to limit the size of your fetch lists. This becomes even more true 
when you start re-fetching. You can actually skip the subset command and allow 
all of them to go in, unless you have your own custom filtering 
method/requirement.
 
You should use the "-topN" command instead when you generate your segment file. 
This will create a segment with an exact number of urls. Below are examples of 
creating a segment with 1 million urls to fetch for each Nutch architecture;
 
(Nutch 0.7) bin/nutch generate db segments -topN 1000000

(Nutch 0.8+) bin/nutch generate crawl/crawldb crawl/segments -topN 1000000
 
----- Original Message ----
From: shrinivas patwardhan <[EMAIL PROTECTED]>
To: nutch-user@lucene.apache.org
Sent: Tuesday, January 2, 2007 4:25:13 AM
Subject: Re: fetcher : some doubts


thank you Sean  Dean
that sounds good .. i will try it out .
tell me if i am  rite :
i case of a dmoz index file is injected in the db .. then i generate only
few segments by  using -subset and then fetch them ..
and then go on and generate the next set of segments i hope i am heading the
right way
and for the previous problem of the searching being slow .. it wasnt my
hardware but my segments were corrupt i fixed them and the search runs fine
now

Thanks & Regards
Shrinivas Patwardhan

Reply via email to