[
https://issues.apache.org/jira/browse/NUTCH-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14588443#comment-14588443
]
Lewis John McGibbney commented on NUTCH-2039:
---------------------------------------------
Good work, I am +1 for this patch.
Some future improvements are:
* a wiki page explaining exactly what the cosine similarity measure entails,
this could be referenced by a simple README.md in the plugin directory.
* abstracting the core similarity functionality interfaces as there are many
different similarity metrics which can be used. This would mean that other
could contribute similar similarity algorithms for pages.
Excellent work. I will commit EoB unless objections.
> Relevance based scoring filter
> ------------------------------
>
> Key: NUTCH-2039
> URL: https://issues.apache.org/jira/browse/NUTCH-2039
> Project: Nutch
> Issue Type: New Feature
> Reporter: Sujen Shah
> Assignee: Sujen Shah
> Labels: memex, nutch
> Fix For: 1.11
>
>
> A ScoringFilter plugin that uses a similarity measure to calculate the
> similarity between a given page(gold standard) and the currently parsed page.
> The score obtained from this similarity is then distributed to its outlinks.
> This filter aims to focus the crawler to crawl/explore relevant pages.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)