Re: Custom outlink scoring (0.8)

Andrzej Bialecki Wed, 06 Sep 2006 07:00:08 -0700

Vinsil wrote:

Hi,


I'd like to implement a custom "topical crawler" using Nutch 0.8. After
digging *a bit* the source code, I'm blocked when it comes to Hadoop.
Unfortunately, I don't have time now to dive into it.

The main idea would be to score the Outlinks based on their topical
relevance to a given subject so they can be ordered by this "relevance
score" during the next fetchlist generation  (using a custom ScoringFilter).
This would lead to a "best-first" fetching strategy with "best" meaning
"more relevant". As the scoring of an outlink would be partly based upon
local textual context around the outlink, the ideal place to compute this
score should (??) be during the parsing of the surrounding page.

A way to do this might be to:
  - compute and add a score metadata to the Outlinks during parsing.
  - retrieve that score in a custom ScoringFilter during fetchlist
generation.

>From what I've understood, the first step doesn't seem possible in Nutch 0.8
(??).
What would be the right way to implement such a behaviour?

Is it possible by creating a pair of custom HtmlParseFilter/ScoringFilter?

You can use ScoringFilter.distributeScoreToOutlink to also modify thetarget CrawlDatum, e.g. store some metadata. Then, inScoringFilter.updateDbScore you can use this metadata to modify theoutput datum based on the metadata collected from inlinked datums(coming from outlinks, and containing your metadata). This output datumis then stored in CrawlDB, so you can use its metadata in the nextround, via ScoringFilter.generatorSortValue.


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Custom outlink scoring (0.8)

Reply via email to