Carl Cerecke wrote:
Carl Cerecke wrote:
Andrzej Bialecki wrote:
Carl Cerecke wrote:
I've given this a crack and it mostly seems to work, except I'm not
sure how to get the score back into the crawldb. After reading the
Javadoc, I figured that passScoreAfterParsing() was the method I need
to implement. All others are just simple one-liners for this case.
Unfortunately, passScoreAfterParsing() is alone in not having a
CrawlDatum argument, so I can't call datum.setScore(); I did notice
that OPICScoringFilter does this in passScoreAfterParsing:
parse.getData().getContentMeta().set(Nutch.SCORE_KEY, ...); and I
tried that in my own scoring filter, but just get the zero from
datum.setScore(0.0f) in initalScore().
Nutch.SCORE_KEY is only used to pass the score value to outlinks.
Couple of questions then:
1. Does it make sense to put the relevancy scoring code into
passScoreAfterParsing()
2. If so, how do I get the score into the crawldb?
I'm a bit vague on how all these bits connect together under the hood
at the moment.....
Spent all day on this, but no luck. I'm sure I'm missing something
obvious. Glad for any pointers in the right direction.
The somewhat awkward API for ScoringFilter comes from the fact that
different data is available at different steps, and similarly different
output data is updated at different steps. When passScoreAfterParsing
executes we don't update the db.
The only method to update the original db entry is a bit indirect -
first, you need to create an "adjust" value (using
CrawlDatum.STATUS_LINKED) in distributeScoreToOulinks, and then detect
this "adjust" value in updateDbScore among other inlinks, and update the
CrawlDatum datum with a new score value.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com