Carl Cerecke wrote:
Hi,
I'm wondering what the best approach is to restrict a crawl to a certain
topic. I know that I can restrict what is crawled by a regex on the URL,
but I also need to restrict pages based on their content (whether they
are on topic or not).
For example, say I wanted to crawl pages about Antarctica. First I start
off with a handful of pages and inject them into the crawldb, and I
generate a fetchlist, and can start sucking the pages down. I update the
crawldb with links from what has just been sucked down, and then during
the next fetch (and subsequent fetches), I want to filter which pages
end up in the segment based on their content (using, perhaps some sort
of antarctica-related-keyword score). Somehow I also need to tell the
crawldb about the URLS which I've sucked down but aren't
antarctica-related pages (so we don't suck them down again).
This seems like the sort of problem other people have solved. Any
pointers? Am I on the right track here? Using nutch 0.9
The easiest way to do this is to implement a ScoringFilter plugin, which
promotes wanted pages and demotes unwanted ones. Please see Javadoc for
the ScoringFilter for details.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com