Re: [Nutch-general] how to restrict the size of segments

Mathijs Homminga Tue, 13 Mar 2007 05:00:00 -0800

You can limit the size of each segment by using the crawler's -topN 
option. This will limit the number of URLs per segment.
You have to run multiple crawl cycles to fetch your 40k urls.
Note that if your documents produce new outlinks, they are put into the 
crawldb after each cycle. The order in which they are fetched is 
determined by the scoring plugin(s).


Btw, if you use a local filesystem you might able to recovery some of 
the fetched data, see:
http://issues.apache.org/jira/browse/NUTCH-451

Mathijs

Harmesh, V2solutions wrote:
> hi all,
>        I had run a crawl of approxmately 40,000 urls . It stop in between
> giving an error of no disk available. Is there any way to restrict the size
> of segements so that only a few MB goes in paticular segment .
> thanks in advance.
>   

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] how to restrict the size of segments

Reply via email to