Kai_testing Middleton wrote:
I am running a nutch crawl of 19 sites. I wish to let this crawl go for about two days
then gracefully stop it (I don't expect it to complete by then). Is there a way to do
this? I want it to stop crawling then build the lucene index. Note that I used a simple
nutch crawl command, rather than the "whole web" crawling methodology:
nutch crawl urls.txt -dir /usr/tmp/19sites -depth 10
I use a iterative approach using a script similar to what Sami blogs
about here:
http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html
I then issue a crawl of 10,000 URLs at a time, and just repeat the
process for as long as the window available. because I use solr to store
the crawl results
It makes the index available during the crawl window.
but I'm a relative newbie as well, so look forward what the experts say.
regards
Ian