[ 
http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12379072 ] 

Andrzej Bialecki  commented on NUTCH-267:
-----------------------------------------

Hmm, resetting the score to 0 is also dubious - it's as if we didn't want it to 
be re-crawled if we can't find any inlinks to it... I believe it should be 
reset to the following value:

    newScore = initialScore - sum(distributedScoreM) + sum(incomingScoreN)

where initialScore is the score we got from previous iterations (or 
injectedScore), sum(distributedScoreM) is what we have distributed to M 
outlinks from that page, and sum(incomingScoreN) is what is contributed by N 
inlinks. Current formula omits the sum(distributedScoreM); it also doesn't 
provide any way to "sponsor" pages with no incoming links so that they won't 
get broke (the concept of "virtual nodes" I mentioned above).

Re: summing logs: yes, but then why use "sqrt(opic) * docSimilarity" instead of 
"log(opic * docSimilarity)"?

> Indexer doesn't consider linkdb when calculating boost value
> ------------------------------------------------------------
>
>          Key: NUTCH-267
>          URL: http://issues.apache.org/jira/browse/NUTCH-267
>      Project: Nutch
>         Type: Bug

>   Components: indexer
>     Versions: 0.8-dev
>     Reporter: Chris Schneider
>     Priority: Minor

>
> Before OPIC was implemented (Nutch 0.7, very early Nutch 0.8-dev), if 
> indexer.boost.by.link.count was true, the indexer boost value was scaled 
> based on the log of the # of inbound links:
>     if (boostByLinkCount)
>       res *= (float)Math.log(Math.E + linkCount);
> This is no longer true (even before Andrzej implemented scoring filters). 
> Instead, the boost value is just the square root (or some other scorePower) 
> of the page score. Shouldn't the invertlinks command, which creates the 
> linkdb, have some affect on the boost value calculated during indexing 
> (either via the OPICScoringFilter or some other built-in filter)?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to