Hi: I have done focused crawling with nutch a few months ago. What I did is to override some methods of scoring-opic plugin before and after passing, just as Krugler said. I have customized scoring meta data. And I even managed to integrate text classifier such as Baysian classifier to automically classify web pages. But maybe because of the small size of training dataset, I didn't get a good precision/recall. In the end, I just wrote a customized scoring algorithm based on heuristic rules from my topic. It works quite well. You can use classifier or a customized one for topic based crawling. It depends on how much training dataset you have and what topic you are crawling. For text classifier, you can try Lingpipe.
good luck yanky http://yanky80.blogspot.com/ 2009/5/14 Ken Krugler <kkrugler_li...@transpac.com> > I'd like to make something like describe in this thread in focusing the >> crawling: >> >> >> http://www.lucidimagination.com/search/document/18ff10be2221173e/nutch_topical_focused_crawl >> >> >> First thing : >> >> scoring the URL using the hypertext label (href) for focusing on some >> URL's >> based on content. >> >> It looks like the inlinkDB does not keep the text of URL...so I can access >> them in the scoring plugin >> does it mean I'd have to develop this from scratch. >> Any advice... a feature for Nutch 2.0 ? >> >> >> Second thing for another project : >> >> scoring the URL based on the content of the page. >> >> It looks like one can not access to the page content... in the scoring >> plugin. >> > > For this (and probably your preceding question) the way we did it is to do > the page content analysis at the same time as page parsing, and put the > result into the CrawlDatum using custom meta-data. > > Then we use the result that we stashed in the meta-data later on, when > doing scoring. > > -- Ken > -- > Ken Krugler > +1 530-210-6378 >