[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598468#comment-14598468
]
Sebastian Nagel commented on NUTCH-2038:
----------------------------------------
bq. From what I understand the problem is that a url filter in nutch has a very
simple interface (has no provision for content) and is only "fired" in the
generator step.
Yep, totally unaware of content. It's "fired" not only in generator but in
multiple places (configurable) from URL injection up to indexing,
bq. if we all agree to let it be a url filter (and that's completely up to you
guys)
Right, that's the first decision to be made. Sorry about that ;) Because we'll
change an existing interface (including all existing implementations!) or add a
new one causing changes in the core, it would be a good idea to spent some time
for discussions:
- how to make this interface extensible for related use cases (e.g., could
accept URLs by relevant terms in the anchor text)
- from where to call the new methods (ParseSegment is not compatible with a
parsing fetcher)
- how to keep to much specific workflow logic ("if content is irrelevant but
URL contains relevant terms") away from the core to keep it lean
> Naive Bayes classifier based url filter
> ---------------------------------------
>
> Key: NUTCH-2038
> URL: https://issues.apache.org/jira/browse/NUTCH-2038
> Project: Nutch
> Issue Type: New Feature
> Components: fetcher, injector, parser
> Reporter: Asitang Mishra
> Assignee: Chris A. Mattmann
> Labels: memex, nutch
> Fix For: 1.11
>
>
> A url filter that will filter out the urls (after the parsing stage, will
> keep only those urls that contain some "hot words" provided again in a list.)
> from that pages that are classified irrelevant by the classifier.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)