[
https://issues.apache.org/jira/browse/NUTCH-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-2627.
------------------------------------
Resolution: Implemented
Assignee: Sebastian Nagel
Committed/merged.
> Fetcher to optionally filter URLs
> ---------------------------------
>
> Key: NUTCH-2627
> URL: https://issues.apache.org/jira/browse/NUTCH-2627
> Project: Nutch
> Issue Type: Improvement
> Components: fetcher
> Affects Versions: 1.16
> Reporter: Sebastian Nagel
> Assignee: Sebastian Nagel
> Priority: Minor
> Fix For: 1.16
>
>
> When running a large web crawl it happens that a webadmin requests to
> immediately stop crawling a certain domain. The default Nutch workflow
> applies URL filters only to seeds and outlinks. Applying filters during fetch
> list generation is expensive with a large CrawlDb (fetch lists are usually
> much shorter). Allowing the fetcher to optionally filter URLs would allow to
> apply changed filter rules to the next launched fetcher job even if the the
> segment has been already created (esp., if multiple segments are generated in
> one turn).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)