[ 
https://issues.apache.org/jira/browse/NUTCH-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15962817#comment-15962817
 ] 

Sebastian Nagel commented on NUTCH-2335:
----------------------------------------

It's only disabled by default for *existing* CrawlDb entries, new URLs (seeds) 
are still filtered and normalized by default, as it has always been the case. 
But items already in the CrawlDb (usually filtered as seeds or outlinks) are 
not filtered or normalized again. That saves a lot of time for large CrawlDbs 
and that's what was verified simply by comparing the times needed to inject 3 
URLs into an existing 20 MB CrawlDb with (or without) the option 
-filterNormalizeAll (and without -nofilter and -nonormalize).

> Injector not to filter and normalize existing URLs in CrawlDb
> -------------------------------------------------------------
>
>                 Key: NUTCH-2335
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2335
>             Project: Nutch
>          Issue Type: Improvement
>          Components: crawldb, injector
>    Affects Versions: 1.12
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>             Fix For: 1.14
>
>
> With NUTCH-1712 the behavior of the Injector has changed in case new URLs are 
> added to an existing CrawlDb:
> - before only injected URLs were filtered and normalized
> - now filters and normalizers are applied to all URLs including those already 
> in the CrawlDb
> The default should be as before not to filter existing URLs. Filtering and 
> normalizing may take long for large CrawlDbs and/or complex URL filters. If 
> URL filter or normalizer rules are not changed there is no need to apply them 
> anew every time new URLs are added. Of course, injected URLs should be 
> filtered and normalized by default.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to