Thx, I have my own heuristic quite clear... however to implement this you
need to be able to 'read' document content and analyze it. I'm (was?) under
the impression that in the scoring plugin you can NOT access the document
content.

Am I wrong ?

Also I don't fully understand why there is method beforeParsing and another
afterParsing and what they are for really.
Is there any documentation that I should read first.

-Ray-
2009/5/14 yanky young <yanky.yo...@gmail.com>

> Hi:
>
> I have done focused crawling with nutch a few months ago. What I did is to
> override some methods of scoring-opic plugin before and after passing, just
> as Krugler said. I have customized scoring meta data. And I even managed to
> integrate text classifier such as Baysian classifier to automically
> classify
> web pages. But maybe because of the small size of training dataset, I
> didn't
> get a good precision/recall. In the end, I just wrote a customized scoring
> algorithm based on heuristic rules from my topic. It works quite well. You
> can use classifier or a customized one for topic based crawling. It depends
> on how much training dataset you have and what topic you are crawling. For
> text classifier, you can try Lingpipe.
>
> good luck
>
> yanky
>
> http://yanky80.blogspot.com/
>
> 2009/5/14 Ken Krugler <kkrugler_li...@transpac.com>
>
> > I'd like to make something like describe in this thread in focusing the
> >> crawling:
> >>
> >>
> >>
> http://www.lucidimagination.com/search/document/18ff10be2221173e/nutch_topical_focused_crawl
> >>
> >>
> >> First thing :
> >>
> >>  scoring the URL using the hypertext label (href) for focusing on some
> >> URL's
> >> based on content.
> >>
> >> It looks like the inlinkDB does not keep the text of URL...so I can
> access
> >> them in the scoring plugin
> >> does it mean I'd have to develop this from scratch.
> >> Any advice... a feature for Nutch 2.0 ?
> >>
> >>
> >> Second thing for another project :
> >>
> >> scoring the URL based on the content of the page.
> >>
> >> It looks like one can not access to the page content... in the scoring
> >> plugin.
> >>
> >
> > For this (and probably your preceding question) the way we did it is to
> do
> > the page content analysis at the same time as page parsing, and put the
> > result into the CrawlDatum using custom meta-data.
> >
> > Then we use the result that we stashed in the meta-data later on, when
> > doing scoring.
> >
> > -- Ken
> > --
> > Ken Krugler
> > +1 530-210-6378
> >
>

Reply via email to