[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598724#comment-14598724
 ] 

Chris A. Mattmann commented on NUTCH-2038:
------------------------------------------

Yeah so here's the deal. I think I can implement a SimilarityUrlFilterPlugin 
that simply calls Tika per URL. Tika is extremely fast and I could do e.g., 
Jaccard similarity on extracted text features (e.g., do something like Gramming 
and/or TF-IDF or some other summarization metric) and/or metadata features. 
This is basically what we did in my 572 class.

Asitang's idea about doing this with a ParseFilter in parse-tika is neat. I 
think this issue should be updated to reflect that and I'll open a separate one 
to do my SimilarityUrlFilter based on Tika. As long as its a plugin and someone 
is willing to support it as a PMC member (aka me, etc.), there is no reason not 
to push forward with it. Asitang can move forward with his ParseFilter and I'll 
review (and others can) what he produces.

> Naive Bayes classifier based url filter
> ---------------------------------------
>
>                 Key: NUTCH-2038
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2038
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, injector, parser
>            Reporter: Asitang Mishra
>            Assignee: Chris A. Mattmann
>              Labels: memex, nutch
>             Fix For: 1.11
>
>
> A url filter that will filter out the urls (after the parsing stage,  will 
> keep only those urls that contain some "hot words" provided again in a list.) 
> from that pages that are classified irrelevant by the classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to