[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595470#comment-14595470
 ] 

Sebastian Nagel commented on NUTCH-2038:
----------------------------------------

The scoring filter interface is complex, you're right, and not easy to 
understand. But scoring filters are powerful and can do a lot of "magic" aside 
from pure "scoring", e.g., limiting crawl by linkage depth and focused 
crawling. The ScoringFilter interface is complex because it must fit into the 
Nutch workflow. In 2.x the interface is simpler because the workflow and the 
underlying data structures are simpler (one web table vs. segments with 
multiple subdirectories). Plugins should be lightweight in terms of using 
resources and it's surely not ideal if they run MapReduce jobs (findDatumForUrl 
must do this in 1.x) or fetch content again via Tika.

> Naive Bayes classifier based url filter
> ---------------------------------------
>
>                 Key: NUTCH-2038
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2038
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, injector, parser
>            Reporter: Asitang Mishra
>            Assignee: Chris A. Mattmann
>              Labels: memex, nutch
>             Fix For: 1.11
>
>
> A url filter that will filter out the urls (after the parsing stage,  will 
> keep only those urls that contain some "hot words" provided again in a list.) 
> from that pages that are classified irrelevant by the classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to