[
http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12379072 ]
Andrzej Bialecki commented on NUTCH-267:
-----------------------------------------
Hmm, resetting the score to 0 is also dubious - it's as if we didn't want it to
be re-crawled if we can't find any inlinks to it... I believe it should be
reset to the following value:
newScore = initialScore - sum(distributedScoreM) + sum(incomingScoreN)
where initialScore is the score we got from previous iterations (or
injectedScore), sum(distributedScoreM) is what we have distributed to M
outlinks from that page, and sum(incomingScoreN) is what is contributed by N
inlinks. Current formula omits the sum(distributedScoreM); it also doesn't
provide any way to "sponsor" pages with no incoming links so that they won't
get broke (the concept of "virtual nodes" I mentioned above).
Re: summing logs: yes, but then why use "sqrt(opic) * docSimilarity" instead of
"log(opic * docSimilarity)"?
> Indexer doesn't consider linkdb when calculating boost value
> ------------------------------------------------------------
>
> Key: NUTCH-267
> URL: http://issues.apache.org/jira/browse/NUTCH-267
> Project: Nutch
> Type: Bug
> Components: indexer
> Versions: 0.8-dev
> Reporter: Chris Schneider
> Priority: Minor
>
> Before OPIC was implemented (Nutch 0.7, very early Nutch 0.8-dev), if
> indexer.boost.by.link.count was true, the indexer boost value was scaled
> based on the log of the # of inbound links:
> if (boostByLinkCount)
> res *= (float)Math.log(Math.E + linkCount);
> This is no longer true (even before Andrzej implemented scoring filters).
> Instead, the boost value is just the square root (or some other scorePower)
> of the page score. Shouldn't the invertlinks command, which creates the
> linkdb, have some affect on the boost value calculated during indexing
> (either via the OPICScoringFilter or some other built-in filter)?
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers