[jira] [Commented] (NUTCH-2039) Relevance based scoring filter

Sujen Shah (JIRA) Mon, 15 Jun 2015 10:32:44 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586355#comment-14586355
 ]


Sujen Shah commented on NUTCH-2039:
-----------------------------------

Thank you [~chrismattmann] and [~wastl-nagel] for your comments. 

* I will extend AbstractScoringFilter as I do not need all the methods. I am 
scoring a URL only after it has been parsed, so I am not performing anything 
while the inject or generate phase.

* [~wastl-nagel], yes the gold-standard file needs to be parsed only once. I 
did not quite get   
bq. "This could be done from SimilarityScoringFilter.setConf(conf) the 
resulting DocumentVector is then cached."
Is the SimilarityScoringFilter.setConf(conf) method only run once during each 
job or run multiple times, if its only run once then I could compute the 
DocumentVector and store it in the conf right ?

* Will change the code to read the files from the filesystem. 

> Relevance based scoring filter
> ------------------------------
>
>                 Key: NUTCH-2039
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2039
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Sujen Shah
>              Labels: memex, nutch
>             Fix For: 1.11
>
>
> A ScoringFilter plugin that uses a similarity measure to calculate the 
> similarity between a given page(gold standard) and the currently parsed page. 
> The score obtained from this similarity is then distributed to its outlinks. 
> This filter aims to focus the crawler to crawl/explore relevant pages. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2039) Relevance based scoring filter

Reply via email to