Great, Now I understand the approach, I thought scoring was only done before fectching for some reasons.
You guys on the Nutch dev group did a great job really. -Ray- 2009/5/15 yanky young <yanky.yo...@gmail.com> > Hi: > > In the scoring plugin, you can get document content. There is one > interface > you can implement: > ScoringFilter< > http://dejafeed.com/nutch-8/docs/api/org/apache/nutch/scoring/ScoringFilter.html > >, > Also you can just extend OPICScoringFilter, and this interface have two > important methods: > > *void passScoreAfterParsing< > http://dejafeed.com/nutch-8/docs/api/org/apache/nutch/scoring/ScoringFilter.html#passScoreAfterParsing%28org.apache.hadoop.io.UTF8,%20org.apache.nutch.protocol.Content,%20org.apache.nutch.parse.Parse%29 > > > *(UTF8< > http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/UTF8.html> > url, > Content< > http://dejafeed.com/nutch-8/docs/api/org/apache/nutch/protocol/Content.html > > > content, > Parse< > http://dejafeed.com/nutch-8/docs/api/org/apache/nutch/parse/Parse.html> > parse) > > *void passScoreBeforeParsing< > http://dejafeed.com/nutch-8/docs/api/org/apache/nutch/scoring/ScoringFilter.html#passScoreBeforeParsing%28org.apache.hadoop.io.UTF8,%20org.apache.nutch.crawl.CrawlDatum,%20org.apache.nutch.protocol.Content%29 > > > *(UTF8< > http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/UTF8.html> > url, > CrawlDatum< > http://dejafeed.com/nutch-8/docs/api/org/apache/nutch/crawl/CrawlDatum.html > > > datum, > Content< > http://dejafeed.com/nutch-8/docs/api/org/apache/nutch/protocol/Content.html > > > content) > > A scoring filter also has other methods for other purpose. The gist is that > scoring in Nutch has many stages, e.g. inject, generate, crawl, updatedb, > etc. And in each stage there is a scoring method correspondingly. You can > know that from the ScoringFilter interface method names. > With regard to the two methods you mentioned, the difference is that one > has > scoring information before parsing and the other has scoring information > after parsing. So in passScoreAfterParsing method, you can get html content > from parse object by Parse.getText(). then you can do whatever anysis you > want. > > I recommend you read source code of OPICScoringFilter. you can get how to > store your own scoring information. > > good luck > > yanky > > 2009/5/15 Raymond Balmès <raymond.bal...@gmail.com> > > > Thx, I have my own heuristic quite clear... however to implement this you > > need to be able to 'read' document content and analyze it. I'm (was?) > under > > the impression that in the scoring plugin you can NOT access the document > > content. > > > > Am I wrong ? > > > > Also I don't fully understand why there is method beforeParsing and > another > > afterParsing and what they are for really. > > Is there any documentation that I should read first. > > > > -Ray- > > 2009/5/14 yanky young <yanky.yo...@gmail.com> > > > > > Hi: > > > > > > I have done focused crawling with nutch a few months ago. What I did is > > to > > > override some methods of scoring-opic plugin before and after passing, > > just > > > as Krugler said. I have customized scoring meta data. And I even > managed > > to > > > integrate text classifier such as Baysian classifier to automically > > > classify > > > web pages. But maybe because of the small size of training dataset, I > > > didn't > > > get a good precision/recall. In the end, I just wrote a customized > > scoring > > > algorithm based on heuristic rules from my topic. It works quite well. > > You > > > can use classifier or a customized one for topic based crawling. It > > depends > > > on how much training dataset you have and what topic you are crawling. > > For > > > text classifier, you can try Lingpipe. > > > > > > good luck > > > > > > yanky > > > > > > http://yanky80.blogspot.com/ > > > > > > 2009/5/14 Ken Krugler <kkrugler_li...@transpac.com> > > > > > > > I'd like to make something like describe in this thread in focusing > the > > > >> crawling: > > > >> > > > >> > > > >> > > > > > > http://www.lucidimagination.com/search/document/18ff10be2221173e/nutch_topical_focused_crawl > > > >> > > > >> > > > >> First thing : > > > >> > > > >> scoring the URL using the hypertext label (href) for focusing on > some > > > >> URL's > > > >> based on content. > > > >> > > > >> It looks like the inlinkDB does not keep the text of URL...so I can > > > access > > > >> them in the scoring plugin > > > >> does it mean I'd have to develop this from scratch. > > > >> Any advice... a feature for Nutch 2.0 ? > > > >> > > > >> > > > >> Second thing for another project : > > > >> > > > >> scoring the URL based on the content of the page. > > > >> > > > >> It looks like one can not access to the page content... in the > scoring > > > >> plugin. > > > >> > > > > > > > > For this (and probably your preceding question) the way we did it is > to > > > do > > > > the page content analysis at the same time as page parsing, and put > the > > > > result into the CrawlDatum using custom meta-data. > > > > > > > > Then we use the result that we stashed in the meta-data later on, > when > > > > doing scoring. > > > > > > > > -- Ken > > > > -- > > > > Ken Krugler > > > > +1 530-210-6378 > > > > > > > > > >