[ 
http://issues.apache.org/jira/browse/NUTCH-230?page=comments#action_12370356 ] 

Andrzej Bialecki  commented on NUTCH-230:
-----------------------------------------

Hmmm, this is a deeply philosophical question... Should you spread out the OPIC 
score to all links that a page sports, or just to the links that you are 
interested in? Which option is closer to the real meaning of the OPIC score?

Let's consider this argument: the OPIC score is a "cash value", and it 
represents an intrinsic value of a page, or its usefulness. If a page contains 
useless links, it should lose some "cash" over those links, i.e. because of 
them the value of the page and its outlinks should be lowered. That's the 
effect we achieve in the current code.

On the other hand, if we were to change the calculation the way you propose, 
pages with a lot of bad links would heavily promote those few good links that 
they have. This seems to contradict the idea of OPIC, which is that "good" 
pages should promote all outlink-ed pages. If we follow your proposal, bad 
pages would promote more agressively than good pages...

> OPIC score for outlinks should be based on # of valid links, not total # of 
> links.
> ----------------------------------------------------------------------------------
>
>          Key: NUTCH-230
>          URL: http://issues.apache.org/jira/browse/NUTCH-230
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Ken Krugler
>     Priority: Minor

>
> In ParseOutputFormat.java, the write() method currently divides the page 
> score by the # of outlinks:
>           score /= links.length;
> It then loops over the links, and any that pass the normalize/filter gauntlet 
> get added to the crawl output.
> But this means that any filtered links result in some amount of the page's 
> OPIC score being "lost".
> For Nutch 0.7, I built a list of valid (post-filter) links, and then used 
> that to determine the per-link OPIC score, after which I iterated over the 
> list, adding entries to the crawl output.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to