Re: Topical/focus URL scoring

Raymond Balmès Fri, 15 May 2009 08:37:37 -0700

Great,

Now I understand the approach, I thought scoring was only done before
fectching for some reasons.


You guys on the Nutch dev group did a great job really.



-Ray-



2009/5/15 yanky young <yanky.yo...@gmail.com>

> Hi:
>
> In the scoring plugin, you can get document content.  There is one
> interface
> you can implement:
> ScoringFilter<
> http://dejafeed.com/nutch-8/docs/api/org/apache/nutch/scoring/ScoringFilter.html
> >,
> Also you can just extend OPICScoringFilter, and this interface have two
> important methods:
>
> *void passScoreAfterParsing<
> http://dejafeed.com/nutch-8/docs/api/org/apache/nutch/scoring/ScoringFilter.html#passScoreAfterParsing%28org.apache.hadoop.io.UTF8,%20org.apache.nutch.protocol.Content,%20org.apache.nutch.parse.Parse%29
> >
> *(UTF8<
> http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/UTF8.html>
> url,
> Content<
> http://dejafeed.com/nutch-8/docs/api/org/apache/nutch/protocol/Content.html
> >
> content,
> Parse<
> http://dejafeed.com/nutch-8/docs/api/org/apache/nutch/parse/Parse.html>
>  parse)
>
> *void passScoreBeforeParsing<
> http://dejafeed.com/nutch-8/docs/api/org/apache/nutch/scoring/ScoringFilter.html#passScoreBeforeParsing%28org.apache.hadoop.io.UTF8,%20org.apache.nutch.crawl.CrawlDatum,%20org.apache.nutch.protocol.Content%29
> >
> *(UTF8<
> http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/UTF8.html>
> url,
> CrawlDatum<
> http://dejafeed.com/nutch-8/docs/api/org/apache/nutch/crawl/CrawlDatum.html
> >
> datum,
> Content<
> http://dejafeed.com/nutch-8/docs/api/org/apache/nutch/protocol/Content.html
> >
>  content)
>
> A scoring filter also has other methods for other purpose. The gist is that
> scoring in Nutch has many stages, e.g. inject, generate, crawl, updatedb,
> etc. And in each stage there is a scoring method correspondingly. You can
> know that from the ScoringFilter interface method names.
> With regard to the two methods you mentioned, the difference is that one
> has
> scoring information before parsing and the other has scoring information
> after parsing. So in passScoreAfterParsing method, you can get html content
> from parse object by Parse.getText(). then you can do whatever anysis you
> want.
>
> I recommend you read source code of OPICScoringFilter. you can get how to
> store your own scoring information.
>
> good luck
>
> yanky
>
> 2009/5/15 Raymond Balmès <raymond.bal...@gmail.com>
>
> > Thx, I have my own heuristic quite clear... however to implement this you
> > need to be able to 'read' document content and analyze it. I'm (was?)
> under
> > the impression that in the scoring plugin you can NOT access the document
> > content.
> >
> > Am I wrong ?
> >
> > Also I don't fully understand why there is method beforeParsing and
> another
> > afterParsing and what they are for really.
> > Is there any documentation that I should read first.
> >
> > -Ray-
> > 2009/5/14 yanky young <yanky.yo...@gmail.com>
> >
> > > Hi:
> > >
> > > I have done focused crawling with nutch a few months ago. What I did is
> > to
> > > override some methods of scoring-opic plugin before and after passing,
> > just
> > > as Krugler said. I have customized scoring meta data. And I even
> managed
> > to
> > > integrate text classifier such as Baysian classifier to automically
> > > classify
> > > web pages. But maybe because of the small size of training dataset, I
> > > didn't
> > > get a good precision/recall. In the end, I just wrote a customized
> > scoring
> > > algorithm based on heuristic rules from my topic. It works quite well.
> > You
> > > can use classifier or a customized one for topic based crawling. It
> > depends
> > > on how much training dataset you have and what topic you are crawling.
> > For
> > > text classifier, you can try Lingpipe.
> > >
> > > good luck
> > >
> > > yanky
> > >
> > > http://yanky80.blogspot.com/
> > >
> > > 2009/5/14 Ken Krugler <kkrugler_li...@transpac.com>
> > >
> > > > I'd like to make something like describe in this thread in focusing
> the
> > > >> crawling:
> > > >>
> > > >>
> > > >>
> > >
> >
> http://www.lucidimagination.com/search/document/18ff10be2221173e/nutch_topical_focused_crawl
> > > >>
> > > >>
> > > >> First thing :
> > > >>
> > > >>  scoring the URL using the hypertext label (href) for focusing on
> some
> > > >> URL's
> > > >> based on content.
> > > >>
> > > >> It looks like the inlinkDB does not keep the text of URL...so I can
> > > access
> > > >> them in the scoring plugin
> > > >> does it mean I'd have to develop this from scratch.
> > > >> Any advice... a feature for Nutch 2.0 ?
> > > >>
> > > >>
> > > >> Second thing for another project :
> > > >>
> > > >> scoring the URL based on the content of the page.
> > > >>
> > > >> It looks like one can not access to the page content... in the
> scoring
> > > >> plugin.
> > > >>
> > > >
> > > > For this (and probably your preceding question) the way we did it is
> to
> > > do
> > > > the page content analysis at the same time as page parsing, and put
> the
> > > > result into the CrawlDatum using custom meta-data.
> > > >
> > > > Then we use the result that we stashed in the meta-data later on,
> when
> > > > doing scoring.
> > > >
> > > > -- Ken
> > > > --
> > > > Ken Krugler
> > > > +1 530-210-6378
> > > >
> > >
> >
>

Re: Topical/focus URL scoring

Reply via email to