Andrzej Bialecki wrote:
Carl Cerecke wrote:
Hi,

I'm wondering what the best approach is to restrict a crawl to a certain topic. I know that I can restrict what is crawled by a regex on the URL, but I also need to restrict pages based on their content (whether they are on topic or not).

For example, say I wanted to crawl pages about Antarctica. First I start off with a handful of pages and inject them into the crawldb, and I generate a fetchlist, and can start sucking the pages down. I update the crawldb with links from what has just been sucked down, and then during the next fetch (and subsequent fetches), I want to filter which pages end up in the segment based on their content (using, perhaps some sort of antarctica-related-keyword score). Somehow I also need to tell the crawldb about the URLS which I've sucked down but aren't antarctica-related pages (so we don't suck them down again).

This seems like the sort of problem other people have solved. Any pointers? Am I on the right track here? Using nutch 0.9

The easiest way to do this is to implement a ScoringFilter plugin, which promotes wanted pages and demotes unwanted ones. Please see Javadoc for the ScoringFilter for details.

I've given this a crack and it mostly seems to work, except I'm not sure how to get the score back into the crawldb. After reading the Javadoc, I figured that passScoreAfterParsing() was the method I need to implement. All others are just simple one-liners for this case. Unfortunately, passScoreAfterParsing() is alone in not having a CrawlDatum argument, so I can't call datum.setScore(); I did notice that OPICScoringFilter does this in passScoreAfterParsing: parse.getData().getContentMeta().set(Nutch.SCORE_KEY, ...); and I tried that in my own scoring filter, but just get the zero from datum.setScore(0.0f) in initalScore().

Couple of questions then:
1. Does it make sense to put the relevancy scoring code into passScoreAfterParsing()
2. If so, how do I get the score into the crawldb?

I'm a bit vague on how all these bits connect together under the hood at the moment.....

Cheers,
Carl.

Reply via email to