Re: Restricting crawl to a certain topic

Brian Whitman Sun, 08 Jul 2007 17:11:02 -0700

We've been trying to get this done -- check here for a start:

http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200702.mbox/[EMAIL PROTECTED]



On Jul 8, 2007, at 7:39 PM, Carl Cerecke wrote:

Hi,
I'm wondering what the best approach is to restrict a crawl to acertain topic. I know that I can restrict what is crawled by aregex on the URL, but I also need to restrict pages based on theircontent (whether they are on topic or not).
For example, say I wanted to crawl pages about Antarctica. First Istart off with a handful of pages and inject them into the crawldb,and I generate a fetchlist, and can start sucking the pages down. Iupdate the crawldb with links from what has just been sucked down,and then during the next fetch (and subsequent fetches), I want tofilter which pages end up in the segment based on their content(using, perhaps some sort of antarctica-related-keyword score).Somehow I also need to tell the crawldb about the URLS which I'vesucked down but aren't antarctica-related pages (so we don't suckthem down again).
This seems like the sort of problem other people have solved. Anypointers? Am I on the right track here? Using nutch 0.9
Cheers,
Carl.

Re: Restricting crawl to a certain topic

Reply via email to