Re: Creating a new scoring filter.

Nicolás Lichtmaier Tue, 27 Feb 2007 07:42:12 -0800

Yeah, but there I don't have the parse data for those new pages. What I
would like to do is override "passScoreAfterParsing()" and not pass
anything: just analyze the parsed data and decide a score. The problem
is that that function doesn't get passed the CrawlDatum... it seems I'll
need to modify Nutch itself.... =(

Can you be a bit more specific about your problem?

I'm indexing a fixed set of URLs that I think are a specific type ofdocument. I don't care about links (I'm using -noAdditions to preventadding links to crawldb, I've backported that to 0.8.x and it's waitingfor somebody to commit it =)https://issues.apache.org/jira/browse/NUTCH-438 ).

I just want to replace the scoring algorithm with one which test if thatURL really is that specific type of document. I want to use the parsedata of a document to calculate its relevance.

Anyway, without the details, here is my guess on how you can do it:
1) In passScoreAfterParsing(), analyze the content and parse text and
put the relevant score information in parse data's metadata.
2) In distributeScoreToOutlink() ignore the outlinks (just give them
initialScore()),
but check your parse data and return an adjust datum with the status
STATUS_LINKED and score extracted from parse data. This adjust datum
will update the score of the original datum in updatedb.

Does this work for you?

It doesn't seem a good way to do it. What if there are no outlinks? Thismethod won't be called at all. And anyway, it would be called once pereach outlink, which would multiplicate the work.


Thanks!

Re: Creating a new scoring filter.

Reply via email to