Carl Cerecke wrote:
> Hi,
> 
> I'm wondering what the best approach is to restrict a crawl to a certain 
> topic. I know that I can restrict what is crawled by a regex on the URL, 
> but I also need to restrict pages based on their content (whether they 
> are on topic or not).
> 
> For example, say I wanted to crawl pages about Antarctica. First I start 
> off with a handful of pages and inject them into the crawldb, and I 
> generate a fetchlist, and can start sucking the pages down. I update the 
> crawldb with links from what has just been sucked down, and then during 
> the next fetch (and subsequent fetches), I want to filter which pages 
> end up in the segment based on their content (using, perhaps some sort 
> of antarctica-related-keyword score). Somehow I also need to tell the 
> crawldb about the URLS which I've sucked down but aren't 
> antarctica-related pages (so we don't suck them down again).
> 
> This seems like the sort of problem other people have solved. Any 
> pointers? Am I on the right track here? Using nutch 0.9

The easiest way to do this is to implement a ScoringFilter plugin, which 
promotes wanted pages and demotes unwanted ones. Please see Javadoc for 
the ScoringFilter for details.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to