[
https://issues.apache.org/jira/browse/NUTCH-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sujen Shah updated NUTCH-2047:
------------------------------
Attachment: part-00000
This file is a dump of the top 1000 URLs.
The model file contained information related to robotics from a wikipedia
article. And the seed list was CMU's Robotics institute homepage
> Improvements to the relevance scoring plugin
> --------------------------------------------
>
> Key: NUTCH-2047
> URL: https://issues.apache.org/jira/browse/NUTCH-2047
> Project: Nutch
> Issue Type: Improvement
> Components: scoring
> Reporter: Sujen Shah
> Labels: memex
> Fix For: 1.11
>
> Attachments: part-00000
>
>
> To discuss the results and improvements on the scoring-similarity plugin
> using the cosine similarity model.
> Currently, the outlinks are distributed the same score as the parent URL.
> Which means an irrelevant URL(with a relevant parent) would be fetched for
> one more round before it gets a lower score and filtered. So we would require
> one additional fetch/parse to score these irrelevant urls(from relevant
> parents) lower.
> Any suggestions on this are appreciated.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)