[ 
https://issues.apache.org/jira/browse/NUTCH-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964224#comment-15964224
 ] 

Markus Jelsma commented on NUTCH-2335:
--------------------------------------

Yes, it still filters/normalizes. Although it is not obvious in the code, the 
mapper i just grabbed a stack trace from actually does perform filtering. I 
also confirmed the build that is running contains this patch.

{code}
"Finalizer" #3 daemon prio=8 os_prio=0 tid=0x00007f9fe8142800 nid=0x26fd in 
Object.wait() [0x00007f9fcac09000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000000d65b6b28> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
        - locked <0x00000000d65b6b28> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)

"Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007f9fe813e000 
nid=0x26fb in Object.wait() [0x00007f9fcad0a000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000000d65b6b58> (a java.lang.ref.Reference$Lock)
        at java.lang.Object.wait(Object.java:502)
        at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
        - locked <0x00000000d65b6b58> (a java.lang.ref.Reference$Lock)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)

"main" #1 prio=5 os_prio=0 tid=0x00007f9fe8013800 nid=0x26c6 runnable 
[0x00007f9fefe9b000]
   java.lang.Thread.State: RUNNABLE
        at java.util.regex.Pattern$5.isSatisfiedBy(Pattern.java:5251)
        at java.util.regex.Pattern$5.isSatisfiedBy(Pattern.java:5251)
        at java.util.regex.Pattern$5.isSatisfiedBy(Pattern.java:5251)
        at java.util.regex.Pattern$5.isSatisfiedBy(Pattern.java:5251)
        at java.util.regex.Pattern$CharProperty.match(Pattern.java:3776)
        at java.util.regex.Pattern$Curly.match0(Pattern.java:4260)
        at java.util.regex.Pattern$Curly.match(Pattern.java:4234)
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4658)
        at java.util.regex.Pattern$Start.match(Pattern.java:3461)
        at java.util.regex.Matcher.search(Matcher.java:1248)
        at java.util.regex.Matcher.find(Matcher.java:637)
        at 
org.apache.nutch.urlfilter.regex.RegexURLFilter$Rule.match(RegexURLFilter.java:107)
        at 
org.apache.nutch.urlfilter.api.RegexURLFilterBase.filter(RegexURLFilterBase.java:190)
        at org.apache.nutch.net.URLFilters.filter(URLFilters.java:39)
        at 
org.apache.nutch.crawl.Injector$InjectMapper.filterNormalize(Injector.java:133)
        at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:228)
        at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:100)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
{code}

> Injector not to filter and normalize existing URLs in CrawlDb
> -------------------------------------------------------------
>
>                 Key: NUTCH-2335
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2335
>             Project: Nutch
>          Issue Type: Improvement
>          Components: crawldb, injector
>    Affects Versions: 1.12
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>             Fix For: 1.14
>
>
> With NUTCH-1712 the behavior of the Injector has changed in case new URLs are 
> added to an existing CrawlDb:
> - before only injected URLs were filtered and normalized
> - now filters and normalizers are applied to all URLs including those already 
> in the CrawlDb
> The default should be as before not to filter existing URLs. Filtering and 
> normalizing may take long for large CrawlDbs and/or complex URL filters. If 
> URL filter or normalizer rules are not changed there is no need to apply them 
> anew every time new URLs are added. Of course, injected URLs should be 
> filtered and normalized by default.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to