[ 
https://issues.apache.org/jira/browse/NUTCH-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15701527#comment-15701527
 ] 

ASF GitHub Bot commented on NUTCH-2335:
---------------------------------------

GitHub user sebastian-nagel opened a pull request:

    https://github.com/apache/nutch/pull/158

    NUTCH-2335 Injector not to filter and normalize existing items/URLs in 
CrawlDb

    Restore the default behavior before NUTCH-1712 and make the usage of URL 
filters and normalizers configurable via command-line options:
    - `-filterNormalizeAll` : normalize and filter all URLs including the URLs 
of existing CrawlDb records
    - `-nonormalize` and `-nofilter` : do not normalize resp. filter any URLs 
(new injected or existing ones)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sebastian-nagel/nutch NUTCH-2335

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/nutch/pull/158.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #158
    
----
commit 779267df83092caaa3bd0b001f6af7ea7eeb1aa7
Author: Sebastian Nagel <[email protected]>
Date:   2016-11-28T09:52:46Z

    NUTCH-2335 Injector not to filter and normalize existing items/URLs in 
CrawlDb

----


> Injector not to filter and normalize existing URLs in CrawlDb
> -------------------------------------------------------------
>
>                 Key: NUTCH-2335
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2335
>             Project: Nutch
>          Issue Type: Improvement
>          Components: crawldb, injector
>    Affects Versions: 1.12
>            Reporter: Sebastian Nagel
>             Fix For: 1.13
>
>
> With NUTCH-1712 the behavior of the Injector has changed in case new URLs are 
> added to an existing CrawlDb:
> - before only injected URLs were filtered and normalized
> - now filters and normalizers are applied to all URLs including those already 
> in the CrawlDb
> The default should be as before not to filter existing URLs. Filtering and 
> normalizing may take long for large CrawlDbs and/or complex URL filters. If 
> URL filter or normalizer rules are not changed there is no need to apply them 
> anew every time new URLs are added. Of course, injected URLs should be 
> filtered and normalized by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to