[ https://issues.apache.org/jira/browse/NUTCH-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775296#comment-16775296 ]
Hudson commented on NUTCH-2627: ------------------------------- FAILURE: Integrated in Jenkins build Nutch-trunk #3611 (See [https://builds.apache.org/job/Nutch-trunk/3611/]) NUTCH-2627 Fetcher to optionally filter URLs - filter and normalize URLs (snagel: [https://github.com/apache/nutch/commit/546237d4789b2df958752a722053a89c31c24597]) * (edit) conf/nutch-default.xml * (edit) src/java/org/apache/nutch/fetcher/QueueFeeder.java > Fetcher to optionally filter URLs > --------------------------------- > > Key: NUTCH-2627 > URL: https://issues.apache.org/jira/browse/NUTCH-2627 > Project: Nutch > Issue Type: Improvement > Components: fetcher > Affects Versions: 1.16 > Reporter: Sebastian Nagel > Assignee: Sebastian Nagel > Priority: Minor > Fix For: 1.16 > > > When running a large web crawl it happens that a webadmin requests to > immediately stop crawling a certain domain. The default Nutch workflow > applies URL filters only to seeds and outlinks. Applying filters during fetch > list generation is expensive with a large CrawlDb (fetch lists are usually > much shorter). Allowing the fetcher to optionally filter URLs would allow to > apply changed filter rules to the next launched fetcher job even if the the > segment has been already created (esp., if multiple segments are generated in > one turn). -- This message was sent by Atlassian JIRA (v7.6.3#76005)