[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598468#comment-14598468
 ] 

Sebastian Nagel commented on NUTCH-2038:
----------------------------------------

bq. From what I understand the problem is that a url filter in nutch has a very 
simple interface (has no provision for content) and is only "fired" in the 
generator step.
Yep, totally unaware of content. It's "fired" not only in generator but in 
multiple places (configurable) from URL injection up to indexing,

bq. if we all agree to let it be a url filter (and that's completely up to you 
guys)
Right, that's the first decision to be made. Sorry about that ;) Because we'll 
change an existing interface (including all existing implementations!) or add a 
new one causing changes in the core, it would be a good idea to spent some time 
for discussions:
- how to make this interface extensible for related use cases (e.g., could 
accept URLs by relevant terms in the anchor text)
- from where to call the new methods (ParseSegment is not compatible with a 
parsing fetcher)
- how to keep to much specific workflow logic ("if content is irrelevant but 
URL contains relevant terms") away from the core to keep it lean



> Naive Bayes classifier based url filter
> ---------------------------------------
>
>                 Key: NUTCH-2038
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2038
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, injector, parser
>            Reporter: Asitang Mishra
>            Assignee: Chris A. Mattmann
>              Labels: memex, nutch
>             Fix For: 1.11
>
>
> A url filter that will filter out the urls (after the parsing stage,  will 
> keep only those urls that contain some "hot words" provided again in a list.) 
> from that pages that are classified irrelevant by the classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to