[jira] [Commented] (NUTCH-2627) Fetcher to optionally filter URLs

Hudson (JIRA) Fri, 22 Feb 2019 07:45:12 -0800


    [ 
https://issues.apache.org/jira/browse/NUTCH-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775296#comment-16775296
 ]


Hudson commented on NUTCH-2627:
-------------------------------

FAILURE: Integrated in Jenkins build Nutch-trunk #3611 (See 
[https://builds.apache.org/job/Nutch-trunk/3611/])
NUTCH-2627 Fetcher to optionally filter URLs - filter and normalize URLs 
(snagel: 
[https://github.com/apache/nutch/commit/546237d4789b2df958752a722053a89c31c24597])
* (edit) conf/nutch-default.xml
* (edit) src/java/org/apache/nutch/fetcher/QueueFeeder.java


> Fetcher to optionally filter URLs
> ---------------------------------
>
>                 Key: NUTCH-2627
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2627
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.16
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.16
>
>
> When running a large web crawl it happens that a webadmin requests to 
> immediately stop crawling a certain domain. The default Nutch workflow 
> applies URL filters only to seeds and outlinks. Applying filters during fetch 
> list generation is expensive with a large CrawlDb (fetch lists are usually 
> much shorter). Allowing the fetcher to optionally filter URLs would allow to 
> apply changed filter rules to the next launched fetcher job even if the the 
> segment has been already created (esp., if multiple segments are generated in 
> one turn).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2627) Fetcher to optionally filter URLs

Reply via email to