[
https://issues.apache.org/jira/browse/NUTCH-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958788#comment-15958788
]
Hudson commented on NUTCH-2335:
-------------------------------
SUCCESS: Integrated in Jenkins build Nutch-trunk #3421 (See
[https://builds.apache.org/job/Nutch-trunk/3421/])
NUTCH-2335 Injector not to filter and normalize existing items/URLs in (snagel:
[https://github.com/apache/nutch/commit/5945db20de21c62795315c095ccf9ff4c61f3ebe])
* (edit) src/java/org/apache/nutch/crawl/Injector.java
> Injector not to filter and normalize existing URLs in CrawlDb
> -------------------------------------------------------------
>
> Key: NUTCH-2335
> URL: https://issues.apache.org/jira/browse/NUTCH-2335
> Project: Nutch
> Issue Type: Improvement
> Components: crawldb, injector
> Affects Versions: 1.12
> Reporter: Sebastian Nagel
> Assignee: Sebastian Nagel
> Fix For: 1.14
>
>
> With NUTCH-1712 the behavior of the Injector has changed in case new URLs are
> added to an existing CrawlDb:
> - before only injected URLs were filtered and normalized
> - now filters and normalizers are applied to all URLs including those already
> in the CrawlDb
> The default should be as before not to filter existing URLs. Filtering and
> normalizing may take long for large CrawlDbs and/or complex URL filters. If
> URL filter or normalizer rules are not changed there is no need to apply them
> anew every time new URLs are added. Of course, injected URLs should be
> filtered and normalized by default.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)