Re: [Nutch-general] Restricting crawl to a certain topic

Carl Cerecke Tue, 10 Jul 2007 21:37:43 -0700

Andrzej Bialecki wrote:
> Carl Cerecke wrote:
>> Hi,
>>
>> I'm wondering what the best approach is to restrict a crawl to a 
>> certain topic. I know that I can restrict what is crawled by a regex 
>> on the URL, but I also need to restrict pages based on their content 
>> (whether they are on topic or not).
>>
>> For example, say I wanted to crawl pages about Antarctica. First I 
>> start off with a handful of pages and inject them into the crawldb, 
>> and I generate a fetchlist, and can start sucking the pages down. I 
>> update the crawldb with links from what has just been sucked down, and 
>> then during the next fetch (and subsequent fetches), I want to filter 
>> which pages end up in the segment based on their content (using, 
>> perhaps some sort of antarctica-related-keyword score). Somehow I also 
>> need to tell the crawldb about the URLS which I've sucked down but 
>> aren't antarctica-related pages (so we don't suck them down again).
>>
>> This seems like the sort of problem other people have solved. Any 
>> pointers? Am I on the right track here? Using nutch 0.9
> 
> The easiest way to do this is to implement a ScoringFilter plugin, which 
> promotes wanted pages and demotes unwanted ones. Please see Javadoc for 
> the ScoringFilter for details.


I've given this a crack and it mostly seems to work, except I'm not sure 
how to get the score back into the crawldb. After reading the Javadoc, I 
figured that passScoreAfterParsing() was the method I need to implement. 
All others are just simple one-liners for this case. Unfortunately, 
passScoreAfterParsing() is alone in not having a CrawlDatum argument, so 
I can't call datum.setScore(); I did notice that OPICScoringFilter does 
this in passScoreAfterParsing: 
parse.getData().getContentMeta().set(Nutch.SCORE_KEY,  ...); and I tried 
that in my own scoring filter, but just get the zero from 
datum.setScore(0.0f) in initalScore().

Couple of questions then:
1. Does it make sense to put the relevancy scoring code into 
passScoreAfterParsing()
2. If so, how do I get the score into the crawldb?

I'm a bit vague on how all these bits connect together under the hood at 
the moment.....

Cheers,
Carl.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Restricting crawl to a certain topic

Reply via email to