I'd like to make something like describe in this thread in focusing the
crawling:

http://www.lucidimagination.com/search/document/18ff10be2221173e/nutch_topical_focused_crawl


First thing :

 scoring the URL using the hypertext label (href) for focusing on some URL's
based on content.

It looks like the inlinkDB does not keep the text of URL...so I can access
them in the scoring plugin
does it mean I'd have to develop this from scratch.
Any advice... a feature for Nutch 2.0 ?


Second thing for another project :

scoring the URL based on the content of the page.

It looks like one can not access to the page content... in the scoring
plugin.


-RB-

Reply via email to