Carl Cerecke wrote:
Andrzej Bialecki wrote:
Carl Cerecke wrote:
Hi,
I'm wondering what the best approach is to restrict a crawl to a
certain topic. I know that I can restrict what is crawled by a regex
on the URL, but I also need to restrict pages based on their content
(whether they are on topic or not).
For example, say I wanted to crawl pages about Antarctica. First I
start off with a handful of pages and inject them into the crawldb,
and I generate a fetchlist, and can start sucking the pages down. I
update the crawldb with links from what has just been sucked down,
and then during the next fetch (and subsequent fetches), I want to
filter which pages end up in the segment based on their content
(using, perhaps some sort of antarctica-related-keyword score).
Somehow I also need to tell the crawldb about the URLS which I've
sucked down but aren't antarctica-related pages (so we don't suck
them down again).
This seems like the sort of problem other people have solved. Any
pointers? Am I on the right track here? Using nutch 0.9
The easiest way to do this is to implement a ScoringFilter plugin,
which promotes wanted pages and demotes unwanted ones. Please see
Javadoc for the ScoringFilter for details.
I've given this a crack and it mostly seems to work, except I'm not sure
how to get the score back into the crawldb. After reading the Javadoc, I
figured that passScoreAfterParsing() was the method I need to implement.
All others are just simple one-liners for this case. Unfortunately,
passScoreAfterParsing() is alone in not having a CrawlDatum argument, so
I can't call datum.setScore(); I did notice that OPICScoringFilter does
this in passScoreAfterParsing:
parse.getData().getContentMeta().set(Nutch.SCORE_KEY, ...); and I tried
that in my own scoring filter, but just get the zero from
datum.setScore(0.0f) in initalScore().
Couple of questions then:
1. Does it make sense to put the relevancy scoring code into
passScoreAfterParsing()
2. If so, how do I get the score into the crawldb?
I'm a bit vague on how all these bits connect together under the hood at
the moment.....
Spent all day on this, but no luck. I'm sure I'm missing something
obvious. Glad for any pointers in the right direction.
Cheers,
Carl.