[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595388#comment-14595388
 ] 

Chris A. Mattmann commented on NUTCH-2038:
------------------------------------------

That's what we were working on. My 572 class in the Fall 2014 (and in Spring 
2015) implemented different versions of the above and it worked OK. I figured 
that I'd contribute it upstream to Nutch - Asitang was one of the students so 
thought we could do both approaches. Furthermore I realize URL filters are 
supposed to be fast, but they also present an understandable workflow. We've 
always had people question scores, and they aren't as intuitive to me as 
"accept this URL (or not)" - to me that's the basis of a domain specific, or 
Focused crawler. Indirectly the score interface can also do this, I agree, but 
to me it's not as explicit as the URLFilter.

> Naive Bayes classifier based url filter
> ---------------------------------------
>
>                 Key: NUTCH-2038
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2038
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, injector, parser
>            Reporter: Asitang Mishra
>            Assignee: Chris A. Mattmann
>              Labels: memex, nutch
>             Fix For: 1.11
>
>
> A url filter that will filter out the urls (after the parsing stage,  will 
> keep only those urls that contain some "hot words" provided again in a list.) 
> from that pages that are classified irrelevant by the classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to