>> Yeah, but there I don't have the parse data for those new pages. What I
>> would like to do is override "passScoreAfterParsing()" and not pass
>> anything: just analyze the parsed data and decide a score. The problem
>> is that that function doesn't get passed the CrawlDatum... it seems I'll
>> need to modify Nutch itself.... =(
> Can you be a bit more specific about your problem?

I'm indexing a fixed set of URLs that I think are a specific type of 
document. I don't care about links (I'm using -noAdditions to prevent 
adding links to crawldb, I've backported that to 0.8.x and it's waiting 
for somebody to commit it =) 
https://issues.apache.org/jira/browse/NUTCH-438 ).

I just want to replace the scoring algorithm with one which test if that 
URL really is that specific type of document. I want to use the parse 
data of a document to calculate its relevance.

> Anyway, without the details, here is my guess on how you can do it:
> 1) In passScoreAfterParsing(), analyze the content and parse text and
> put the relevant score information in parse data's metadata.
> 2) In distributeScoreToOutlink() ignore the outlinks (just give them
> initialScore()),
> but check your parse data and return an adjust datum with the status
> STATUS_LINKED and score extracted from parse data. This adjust datum
> will update the score of the original datum in updatedb.
>
> Does this work for you?

It doesn't seem a good way to do it. What if there are no outlinks? This 
method won't be called at all. And anyway, it would be called once per 
each outlink, which would multiplicate the work.

Thanks!


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to