[
https://issues.apache.org/jira/browse/NUTCH-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586355#comment-14586355
]
Sujen Shah commented on NUTCH-2039:
-----------------------------------
Thank you [~chrismattmann] and [~wastl-nagel] for your comments.
* I will extend AbstractScoringFilter as I do not need all the methods. I am
scoring a URL only after it has been parsed, so I am not performing anything
while the inject or generate phase.
* [~wastl-nagel], yes the gold-standard file needs to be parsed only once. I
did not quite get
bq. "This could be done from SimilarityScoringFilter.setConf(conf) the
resulting DocumentVector is then cached."
Is the SimilarityScoringFilter.setConf(conf) method only run once during each
job or run multiple times, if its only run once then I could compute the
DocumentVector and store it in the conf right ?
* Will change the code to read the files from the filesystem.
> Relevance based scoring filter
> ------------------------------
>
> Key: NUTCH-2039
> URL: https://issues.apache.org/jira/browse/NUTCH-2039
> Project: Nutch
> Issue Type: New Feature
> Reporter: Sujen Shah
> Labels: memex, nutch
> Fix For: 1.11
>
>
> A ScoringFilter plugin that uses a similarity measure to calculate the
> similarity between a given page(gold standard) and the currently parsed page.
> The score obtained from this similarity is then distributed to its outlinks.
> This filter aims to focus the crawler to crawl/explore relevant pages.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)