Hi,

I'm wondering what the best approach is to restrict a crawl to a certain topic. I know that I can restrict what is crawled by a regex on the URL, but I also need to restrict pages based on their content (whether they are on topic or not).

For example, say I wanted to crawl pages about Antarctica. First I start off with a handful of pages and inject them into the crawldb, and I generate a fetchlist, and can start sucking the pages down. I update the crawldb with links from what has just been sucked down, and then during the next fetch (and subsequent fetches), I want to filter which pages end up in the segment based on their content (using, perhaps some sort of antarctica-related-keyword score). Somehow I also need to tell the crawldb about the URLS which I've sucked down but aren't antarctica-related pages (so we don't suck them down again).

This seems like the sort of problem other people have solved. Any pointers? Am I on the right track here? Using nutch 0.9

Cheers,
Carl.

Reply via email to