Re: how to restrict the size of segments

Mathijs Homminga Tue, 13 Mar 2007 04:59:53 -0800

You can limit the size of each segment by using the crawler's -topNoption. This will limit the number of URLs per segment.

You have to run multiple crawl cycles to fetch your 40k urls.

Note that if your documents produce new outlinks, they are put into thecrawldb after each cycle. The order in which they are fetched isdetermined by the scoring plugin(s).

Btw, if you use a local filesystem you might able to recovery some ofthe fetched data, see:

http://issues.apache.org/jira/browse/NUTCH-451

Mathijs

Harmesh, V2solutions wrote:

hi all,
       I had run a crawl of approxmately 40,000 urls . It stop in between
giving an error of no disk available. Is there any way to restrict the size
of segements so that only a few MB goes in paticular segment .
thanks in advance.

Re: how to restrict the size of segments

Reply via email to