continuous crawling?

Daniel Naber Wed, 12 Dec 2007 15:35:42 -0800

Hi,

how do people use Nutch to crawl continuously? Do you use the "recrawl" 
script from the Wiki and start that via cronjob? I'd prefer a process that 
runs forever and that makes sure the index is always mostly up-to-date.


In my case, I'm not trying to index the complete web but only interesting 
sites. What an interesting site is will be decided during crawling using a 
plugin I'm planning to write. Does anybody have experience with this kind 
of use case? To my understanding, I'll need to modify the Generator class 
so that it completely ignores pages (and their links) if the page is 
considered irrelevant by my plugin.

Regards
 Daniel

-- 
http://www.danielnaber.de

continuous crawling?

Reply via email to