Re: Interrupting a nutch crawl -- or use topN?

Ian Holsman Sat, 30 Jun 2007 16:38:31 -0700

Kai_testing Middleton wrote:

I am running a nutch crawl of 19 sites.  I wish to let this crawl go for about two days 
then gracefully stop it (I don't expect it to complete by then).  Is there a way to do 
this?  I want it to stop crawling then build the lucene index.  Note that I used a simple 
nutch crawl command, rather than the "whole web" crawling methodology:


nutch crawl urls.txt -dir /usr/tmp/19sites -depth 10

I use a iterative approach using a script similar to what Sami blogsabout here:http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html

I then issue a crawl of 10,000 URLs at a time, and just repeat theprocess for as long as the window available. because I use solr to storethe crawl results

It makes the index available during the crawl window.

but I'm a relative newbie as well, so look forward what the experts say.


regards
Ian

Re: Interrupting a nutch crawl -- or use topN?

Reply via email to