Re: Restricting crawl to a certain topic

Andrzej Bialecki Mon, 09 Jul 2007 05:30:36 -0700

Carl Cerecke wrote:

Hi,
I'm wondering what the best approach is to restrict a crawl to a certaintopic. I know that I can restrict what is crawled by a regex on the URL,but I also need to restrict pages based on their content (whether theyare on topic or not).
For example, say I wanted to crawl pages about Antarctica. First I startoff with a handful of pages and inject them into the crawldb, and Igenerate a fetchlist, and can start sucking the pages down. I update thecrawldb with links from what has just been sucked down, and then duringthe next fetch (and subsequent fetches), I want to filter which pagesend up in the segment based on their content (using, perhaps some sortof antarctica-related-keyword score). Somehow I also need to tell thecrawldb about the URLS which I've sucked down but aren'tantarctica-related pages (so we don't suck them down again).
This seems like the sort of problem other people have solved. Anypointers? Am I on the right track here? Using nutch 0.9

The easiest way to do this is to implement a ScoringFilter plugin, whichpromotes wanted pages and demotes unwanted ones. Please see Javadoc forthe ScoringFilter for details.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Restricting crawl to a certain topic

Reply via email to