Carl Cerecke wrote:
Carl Cerecke wrote:
Andrzej Bialecki wrote:
Carl Cerecke wrote:

I've given this a crack and it mostly seems to work, except I'm not sure how to get the score back into the crawldb. After reading the Javadoc, I figured that passScoreAfterParsing() was the method I need to implement. All others are just simple one-liners for this case. Unfortunately, passScoreAfterParsing() is alone in not having a CrawlDatum argument, so I can't call datum.setScore(); I did notice that OPICScoringFilter does this in passScoreAfterParsing: parse.getData().getContentMeta().set(Nutch.SCORE_KEY, ...); and I tried that in my own scoring filter, but just get the zero from datum.setScore(0.0f) in initalScore().


Nutch.SCORE_KEY is only used to pass the score value to outlinks.



Couple of questions then:
1. Does it make sense to put the relevancy scoring code into passScoreAfterParsing()
2. If so, how do I get the score into the crawldb?

I'm a bit vague on how all these bits connect together under the hood at the moment.....

Spent all day on this, but no luck. I'm sure I'm missing something obvious. Glad for any pointers in the right direction.

The somewhat awkward API for ScoringFilter comes from the fact that different data is available at different steps, and similarly different output data is updated at different steps. When passScoreAfterParsing executes we don't update the db.

The only method to update the original db entry is a bit indirect - first, you need to create an "adjust" value (using CrawlDatum.STATUS_LINKED) in distributeScoreToOulinks, and then detect this "adjust" value in updateDbScore among other inlinks, and update the CrawlDatum datum with a new score value.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to