You could write a shell script, to be executed via a Cron job every minute
or so, to stat the temp file directory, and if the size is over the set
limit, terminate the java thread. Or, if you can program sufficiently, add
some Java to the crawler code.

As such, I don't believe there is any setting in the configuration files
that allows you to do such a thing.

Regards,
Alexander

-----Original Message-----
From: Olena Medelyan [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, 21 March 2006 3:46 PM
To: [email protected]
Subject: How to terminate the crawl?

Hi,

I'm using the crawl tool in nutch to crawl web starting from a set of 
URL seeds. The crawl normally finishes after the specified depth was 
reached. Is it possible to terminate after a pre-defined number of pages 
or a text data of a pre-defined size (e.g. 500 MB) has been crawled? 
Thank you for any hints!

Regards,
Olena





-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to