[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595470#comment-14595470
]
Sebastian Nagel commented on NUTCH-2038:
----------------------------------------
The scoring filter interface is complex, you're right, and not easy to
understand. But scoring filters are powerful and can do a lot of "magic" aside
from pure "scoring", e.g., limiting crawl by linkage depth and focused
crawling. The ScoringFilter interface is complex because it must fit into the
Nutch workflow. In 2.x the interface is simpler because the workflow and the
underlying data structures are simpler (one web table vs. segments with
multiple subdirectories). Plugins should be lightweight in terms of using
resources and it's surely not ideal if they run MapReduce jobs (findDatumForUrl
must do this in 1.x) or fetch content again via Tika.
> Naive Bayes classifier based url filter
> ---------------------------------------
>
> Key: NUTCH-2038
> URL: https://issues.apache.org/jira/browse/NUTCH-2038
> Project: Nutch
> Issue Type: New Feature
> Components: fetcher, injector, parser
> Reporter: Asitang Mishra
> Assignee: Chris A. Mattmann
> Labels: memex, nutch
> Fix For: 1.11
>
>
> A url filter that will filter out the urls (after the parsing stage, will
> keep only those urls that contain some "hot words" provided again in a list.)
> from that pages that are classified irrelevant by the classifier.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)