We've been trying to get this done -- check here for a start:

http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/ 
200702.mbox/[EMAIL PROTECTED]


On Jul 8, 2007, at 7:39 PM, Carl Cerecke wrote:

> Hi,
>
> I'm wondering what the best approach is to restrict a crawl to a  
> certain topic. I know that I can restrict what is crawled by a  
> regex on the URL, but I also need to restrict pages based on their  
> content (whether they are on topic or not).
>
> For example, say I wanted to crawl pages about Antarctica. First I  
> start off with a handful of pages and inject them into the crawldb,  
> and I generate a fetchlist, and can start sucking the pages down. I  
> update the crawldb with links from what has just been sucked down,  
> and then during the next fetch (and subsequent fetches), I want to  
> filter which pages end up in the segment based on their content  
> (using, perhaps some sort of antarctica-related-keyword score).  
> Somehow I also need to tell the crawldb about the URLS which I've  
> sucked down but aren't antarctica-related pages (so we don't suck  
> them down again).
>
> This seems like the sort of problem other people have solved. Any  
> pointers? Am I on the right track here? Using nutch 0.9
>
> Cheers,
> Carl.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to