Hi, how do people use Nutch to crawl continuously? Do you use the "recrawl" script from the Wiki and start that via cronjob? I'd prefer a process that runs forever and that makes sure the index is always mostly up-to-date.
In my case, I'm not trying to index the complete web but only interesting sites. What an interesting site is will be decided during crawling using a plugin I'm planning to write. Does anybody have experience with this kind of use case? To my understanding, I'll need to modify the Generator class so that it completely ignores pages (and their links) if the page is considered irrelevant by my plugin. Regards Daniel -- http://www.danielnaber.de
