We've been trying to get this done -- check here for a start: http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/ 200702.mbox/[EMAIL PROTECTED]
On Jul 8, 2007, at 7:39 PM, Carl Cerecke wrote: > Hi, > > I'm wondering what the best approach is to restrict a crawl to a > certain topic. I know that I can restrict what is crawled by a > regex on the URL, but I also need to restrict pages based on their > content (whether they are on topic or not). > > For example, say I wanted to crawl pages about Antarctica. First I > start off with a handful of pages and inject them into the crawldb, > and I generate a fetchlist, and can start sucking the pages down. I > update the crawldb with links from what has just been sucked down, > and then during the next fetch (and subsequent fetches), I want to > filter which pages end up in the segment based on their content > (using, perhaps some sort of antarctica-related-keyword score). > Somehow I also need to tell the crawldb about the URLS which I've > sucked down but aren't antarctica-related pages (so we don't suck > them down again). > > This seems like the sort of problem other people have solved. Any > pointers? Am I on the right track here? Using nutch 0.9 > > Cheers, > Carl. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
