Re: Topical/focus URL scoring

yanky young Thu, 14 May 2009 19:05:56 -0700

Hi:

In the scoring plugin, you can get document content.  There is one interface
you can implement:
ScoringFilter<http://dejafeed.com/nutch-8/docs/api/org/apache/nutch/scoring/ScoringFilter.html>,
Also you can just extend OPICScoringFilter, and this interface have two
important methods:


*void 
passScoreAfterParsing<http://dejafeed.com/nutch-8/docs/api/org/apache/nutch/scoring/ScoringFilter.html#passScoreAfterParsing%28org.apache.hadoop.io.UTF8,%20org.apache.nutch.protocol.Content,%20org.apache.nutch.parse.Parse%29>
*(UTF8<http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/UTF8.html>
url,
Content<http://dejafeed.com/nutch-8/docs/api/org/apache/nutch/protocol/Content.html>
content,
Parse<http://dejafeed.com/nutch-8/docs/api/org/apache/nutch/parse/Parse.html>
 parse)

*void 
passScoreBeforeParsing<http://dejafeed.com/nutch-8/docs/api/org/apache/nutch/scoring/ScoringFilter.html#passScoreBeforeParsing%28org.apache.hadoop.io.UTF8,%20org.apache.nutch.crawl.CrawlDatum,%20org.apache.nutch.protocol.Content%29>
*(UTF8<http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/UTF8.html>
url,
CrawlDatum<http://dejafeed.com/nutch-8/docs/api/org/apache/nutch/crawl/CrawlDatum.html>
datum,
Content<http://dejafeed.com/nutch-8/docs/api/org/apache/nutch/protocol/Content.html>
 content)

A scoring filter also has other methods for other purpose. The gist is that
scoring in Nutch has many stages, e.g. inject, generate, crawl, updatedb,
etc. And in each stage there is a scoring method correspondingly. You can
know that from the ScoringFilter interface method names.
With regard to the two methods you mentioned, the difference is that one has
scoring information before parsing and the other has scoring information
after parsing. So in passScoreAfterParsing method, you can get html content
from parse object by Parse.getText(). then you can do whatever anysis you
want.

I recommend you read source code of OPICScoringFilter. you can get how to
store your own scoring information.

good luck

yanky

2009/5/15 Raymond Balmès <raymond.bal...@gmail.com>

> Thx, I have my own heuristic quite clear... however to implement this you
> need to be able to 'read' document content and analyze it. I'm (was?) under
> the impression that in the scoring plugin you can NOT access the document
> content.
>
> Am I wrong ?
>
> Also I don't fully understand why there is method beforeParsing and another
> afterParsing and what they are for really.
> Is there any documentation that I should read first.
>
> -Ray-
> 2009/5/14 yanky young <yanky.yo...@gmail.com>
>
> > Hi:
> >
> > I have done focused crawling with nutch a few months ago. What I did is
> to
> > override some methods of scoring-opic plugin before and after passing,
> just
> > as Krugler said. I have customized scoring meta data. And I even managed
> to
> > integrate text classifier such as Baysian classifier to automically
> > classify
> > web pages. But maybe because of the small size of training dataset, I
> > didn't
> > get a good precision/recall. In the end, I just wrote a customized
> scoring
> > algorithm based on heuristic rules from my topic. It works quite well.
> You
> > can use classifier or a customized one for topic based crawling. It
> depends
> > on how much training dataset you have and what topic you are crawling.
> For
> > text classifier, you can try Lingpipe.
> >
> > good luck
> >
> > yanky
> >
> > http://yanky80.blogspot.com/
> >
> > 2009/5/14 Ken Krugler <kkrugler_li...@transpac.com>
> >
> > > I'd like to make something like describe in this thread in focusing the
> > >> crawling:
> > >>
> > >>
> > >>
> >
> http://www.lucidimagination.com/search/document/18ff10be2221173e/nutch_topical_focused_crawl
> > >>
> > >>
> > >> First thing :
> > >>
> > >>  scoring the URL using the hypertext label (href) for focusing on some
> > >> URL's
> > >> based on content.
> > >>
> > >> It looks like the inlinkDB does not keep the text of URL...so I can
> > access
> > >> them in the scoring plugin
> > >> does it mean I'd have to develop this from scratch.
> > >> Any advice... a feature for Nutch 2.0 ?
> > >>
> > >>
> > >> Second thing for another project :
> > >>
> > >> scoring the URL based on the content of the page.
> > >>
> > >> It looks like one can not access to the page content... in the scoring
> > >> plugin.
> > >>
> > >
> > > For this (and probably your preceding question) the way we did it is to
> > do
> > > the page content analysis at the same time as page parsing, and put the
> > > result into the CrawlDatum using custom meta-data.
> > >
> > > Then we use the result that we stashed in the meta-data later on, when
> > > doing scoring.
> > >
> > > -- Ken
> > > --
> > > Ken Krugler
> > > +1 530-210-6378
> > >
> >
>

Re: Topical/focus URL scoring

Reply via email to