[ 
https://issues.apache.org/jira/browse/NUTCH-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sujen Shah updated NUTCH-2047:
------------------------------
    Attachment: part-00000

This file is a dump of the top 1000 URLs. 
The model file contained information related to robotics from a wikipedia 
article. And the seed list was CMU's Robotics institute homepage 

> Improvements to the relevance scoring plugin
> --------------------------------------------
>
>                 Key: NUTCH-2047
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2047
>             Project: Nutch
>          Issue Type: Improvement
>          Components: scoring
>            Reporter: Sujen Shah
>              Labels: memex
>             Fix For: 1.11
>
>         Attachments: part-00000
>
>
> To discuss the results and improvements on the scoring-similarity plugin 
> using the cosine similarity model. 
> Currently, the outlinks are distributed the same score as the parent URL. 
> Which means an irrelevant URL(with a relevant parent) would be fetched for 
> one more round before it gets a lower score and filtered. So we would require 
> one additional fetch/parse to score these irrelevant urls(from relevant 
> parents) lower. 
> Any suggestions on this are appreciated. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to