[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598724#comment-14598724
]
Chris A. Mattmann commented on NUTCH-2038:
------------------------------------------
Yeah so here's the deal. I think I can implement a SimilarityUrlFilterPlugin
that simply calls Tika per URL. Tika is extremely fast and I could do e.g.,
Jaccard similarity on extracted text features (e.g., do something like Gramming
and/or TF-IDF or some other summarization metric) and/or metadata features.
This is basically what we did in my 572 class.
Asitang's idea about doing this with a ParseFilter in parse-tika is neat. I
think this issue should be updated to reflect that and I'll open a separate one
to do my SimilarityUrlFilter based on Tika. As long as its a plugin and someone
is willing to support it as a PMC member (aka me, etc.), there is no reason not
to push forward with it. Asitang can move forward with his ParseFilter and I'll
review (and others can) what he produces.
> Naive Bayes classifier based url filter
> ---------------------------------------
>
> Key: NUTCH-2038
> URL: https://issues.apache.org/jira/browse/NUTCH-2038
> Project: Nutch
> Issue Type: New Feature
> Components: fetcher, injector, parser
> Reporter: Asitang Mishra
> Assignee: Chris A. Mattmann
> Labels: memex, nutch
> Fix For: 1.11
>
>
> A url filter that will filter out the urls (after the parsing stage, will
> keep only those urls that contain some "hot words" provided again in a list.)
> from that pages that are classified irrelevant by the classifier.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)