Hi:

I have done focused crawling with nutch a few months ago. What I did is to
override some methods of scoring-opic plugin before and after passing, just
as Krugler said. I have customized scoring meta data. And I even managed to
integrate text classifier such as Baysian classifier to automically classify
web pages. But maybe because of the small size of training dataset, I didn't
get a good precision/recall. In the end, I just wrote a customized scoring
algorithm based on heuristic rules from my topic. It works quite well. You
can use classifier or a customized one for topic based crawling. It depends
on how much training dataset you have and what topic you are crawling. For
text classifier, you can try Lingpipe.

good luck

yanky

http://yanky80.blogspot.com/

2009/5/14 Ken Krugler <kkrugler_li...@transpac.com>

> I'd like to make something like describe in this thread in focusing the
>> crawling:
>>
>>
>> http://www.lucidimagination.com/search/document/18ff10be2221173e/nutch_topical_focused_crawl
>>
>>
>> First thing :
>>
>>  scoring the URL using the hypertext label (href) for focusing on some
>> URL's
>> based on content.
>>
>> It looks like the inlinkDB does not keep the text of URL...so I can access
>> them in the scoring plugin
>> does it mean I'd have to develop this from scratch.
>> Any advice... a feature for Nutch 2.0 ?
>>
>>
>> Second thing for another project :
>>
>> scoring the URL based on the content of the page.
>>
>> It looks like one can not access to the page content... in the scoring
>> plugin.
>>
>
> For this (and probably your preceding question) the way we did it is to do
> the page content analysis at the same time as page parsing, and put the
> result into the CrawlDatum using custom meta-data.
>
> Then we use the result that we stashed in the meta-data later on, when
> doing scoring.
>
> -- Ken
> --
> Ken Krugler
> +1 530-210-6378
>

Reply via email to