I'd like to make something like describe in this thread in focusing the crawling:
http://www.lucidimagination.com/search/document/18ff10be2221173e/nutch_topical_focused_crawl First thing : scoring the URL using the hypertext label (href) for focusing on some URL's based on content. It looks like the inlinkDB does not keep the text of URL...so I can access them in the scoring plugin does it mean I'd have to develop this from scratch. Any advice... a feature for Nutch 2.0 ? Second thing for another project : scoring the URL based on the content of the page. It looks like one can not access to the page content... in the scoring plugin. -RB-