Yeah, but there I don't have the parse data for those new pages. What I
would like to do is override "passScoreAfterParsing()" and not pass
anything: just analyze the parsed data and decide a score. The problem
is that that function doesn't get passed the CrawlDatum... it seems I'll
need to modify Nutch itself.... =(
Can you be a bit more specific about your problem?

I'm indexing a fixed set of URLs that I think are a specific type of document. I don't care about links (I'm using -noAdditions to prevent adding links to crawldb, I've backported that to 0.8.x and it's waiting for somebody to commit it =) https://issues.apache.org/jira/browse/NUTCH-438 ).

I just want to replace the scoring algorithm with one which test if that URL really is that specific type of document. I want to use the parse data of a document to calculate its relevance.

Anyway, without the details, here is my guess on how you can do it:
1) In passScoreAfterParsing(), analyze the content and parse text and
put the relevant score information in parse data's metadata.
2) In distributeScoreToOutlink() ignore the outlinks (just give them
initialScore()),
but check your parse data and return an adjust datum with the status
STATUS_LINKED and score extracted from parse data. This adjust datum
will update the score of the original datum in updatedb.

Does this work for you?

It doesn't seem a good way to do it. What if there are no outlinks? This method won't be called at all. And anyway, it would be called once per each outlink, which would multiplicate the work.

Thanks!

Reply via email to