I'd like to make something like describe in this thread in focusing the
crawling:
http://www.lucidimagination.com/search/document/18ff10be2221173e/nutch_topical_focused_crawl
First thing :
scoring the URL using the hypertext label (href) for focusing on some URL's
based on content.
It looks like the inlinkDB does not keep the text of URL...so I can access
them in the scoring plugin
does it mean I'd have to develop this from scratch.
Any advice... a feature for Nutch 2.0 ?
Second thing for another project :
scoring the URL based on the content of the page.
It looks like one can not access to the page content... in the scoring
plugin.
For this (and probably your preceding question) the way we did it is
to do the page content analysis at the same time as page parsing, and
put the result into the CrawlDatum using custom meta-data.
Then we use the result that we stashed in the meta-data later on,
when doing scoring.
-- Ken
--
Ken Krugler
+1 530-210-6378