[ 
http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12379116 ] 

Doug Cutting commented on NUTCH-267:
------------------------------------

re: it's as if we didn't want it to be re-crawled if we can't find any inlinks 
to it

We prioritize crawling based on the number of pages we've crawled that link to 
it since we've last crawled it.  Assuming it had links to it that caused it to 
be crawled the first time, and that some of those will also be re-crawled, then 
its score will again increase.  But if no one links to it anymore, it will 
languish, and not be crawled again unless there're no higher-scoring pages.  
That sounds right to me, and I think it's what's suggested in the OPIC paper 
(if i skimmed it correctly).

Perhaps it should not be reset to zero, but one, since that's where pages start 
out.

re: why use "sqrt(opic) * docSimilarity" instead of "log(opic * docSimilarity)"

Wrapping log() around things changes the score value but not the ranking.  So 
the question is really, why use sqrt(opic)*docSimilarity and not just 
opic*docSimilarity?  The answer is simply that I tried a few queries and sqrt 
seemed to be required for OPIC to not overly dominate scoring.  It was a "seat 
of the pants" calculation, trying to balance the strength of anchor matches, 
opic scoring and title, url and body matching, etc.  One can disable this by 
changing the score power parameter.

> Indexer doesn't consider linkdb when calculating boost value
> ------------------------------------------------------------
>
>          Key: NUTCH-267
>          URL: http://issues.apache.org/jira/browse/NUTCH-267
>      Project: Nutch
>         Type: Bug

>   Components: indexer
>     Versions: 0.8-dev
>     Reporter: Chris Schneider
>     Priority: Minor

>
> Before OPIC was implemented (Nutch 0.7, very early Nutch 0.8-dev), if 
> indexer.boost.by.link.count was true, the indexer boost value was scaled 
> based on the log of the # of inbound links:
>     if (boostByLinkCount)
>       res *= (float)Math.log(Math.E + linkCount);
> This is no longer true (even before Andrzej implemented scoring filters). 
> Instead, the boost value is just the square root (or some other scorePower) 
> of the page score. Shouldn't the invertlinks command, which creates the 
> linkdb, have some affect on the boost value calculated during indexing 
> (either via the OPICScoringFilter or some other built-in filter)?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to