[ https://issues.apache.org/jira/browse/NUTCH-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sujen Shah updated NUTCH-2047: ------------------------------ Attachment: part-00000 This file is a dump of the top 1000 URLs. The model file contained information related to robotics from a wikipedia article. And the seed list was CMU's Robotics institute homepage > Improvements to the relevance scoring plugin > -------------------------------------------- > > Key: NUTCH-2047 > URL: https://issues.apache.org/jira/browse/NUTCH-2047 > Project: Nutch > Issue Type: Improvement > Components: scoring > Reporter: Sujen Shah > Labels: memex > Fix For: 1.11 > > Attachments: part-00000 > > > To discuss the results and improvements on the scoring-similarity plugin > using the cosine similarity model. > Currently, the outlinks are distributed the same score as the parent URL. > Which means an irrelevant URL(with a relevant parent) would be fetched for > one more round before it gets a lower score and filtered. So we would require > one additional fetch/parse to score these irrelevant urls(from relevant > parents) lower. > Any suggestions on this are appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)