[
https://issues.apache.org/jira/browse/NUTCH-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964224#comment-15964224
]
Markus Jelsma commented on NUTCH-2335:
--------------------------------------
Yes, it still filters/normalizes. Although it is not obvious in the code, the
mapper i just grabbed a stack trace from actually does perform filtering. I
also confirmed the build that is running contains this patch.
{code}
"Finalizer" #3 daemon prio=8 os_prio=0 tid=0x00007f9fe8142800 nid=0x26fd in
Object.wait() [0x00007f9fcac09000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000d65b6b28> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
- locked <0x00000000d65b6b28> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)
"Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007f9fe813e000
nid=0x26fb in Object.wait() [0x00007f9fcad0a000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000d65b6b58> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:502)
at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
- locked <0x00000000d65b6b58> (a java.lang.ref.Reference$Lock)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)
"main" #1 prio=5 os_prio=0 tid=0x00007f9fe8013800 nid=0x26c6 runnable
[0x00007f9fefe9b000]
java.lang.Thread.State: RUNNABLE
at java.util.regex.Pattern$5.isSatisfiedBy(Pattern.java:5251)
at java.util.regex.Pattern$5.isSatisfiedBy(Pattern.java:5251)
at java.util.regex.Pattern$5.isSatisfiedBy(Pattern.java:5251)
at java.util.regex.Pattern$5.isSatisfiedBy(Pattern.java:5251)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3776)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4260)
at java.util.regex.Pattern$Curly.match(Pattern.java:4234)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4658)
at java.util.regex.Pattern$Start.match(Pattern.java:3461)
at java.util.regex.Matcher.search(Matcher.java:1248)
at java.util.regex.Matcher.find(Matcher.java:637)
at
org.apache.nutch.urlfilter.regex.RegexURLFilter$Rule.match(RegexURLFilter.java:107)
at
org.apache.nutch.urlfilter.api.RegexURLFilterBase.filter(RegexURLFilterBase.java:190)
at org.apache.nutch.net.URLFilters.filter(URLFilters.java:39)
at
org.apache.nutch.crawl.Injector$InjectMapper.filterNormalize(Injector.java:133)
at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:228)
at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:100)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
{code}
> Injector not to filter and normalize existing URLs in CrawlDb
> -------------------------------------------------------------
>
> Key: NUTCH-2335
> URL: https://issues.apache.org/jira/browse/NUTCH-2335
> Project: Nutch
> Issue Type: Improvement
> Components: crawldb, injector
> Affects Versions: 1.12
> Reporter: Sebastian Nagel
> Assignee: Sebastian Nagel
> Fix For: 1.14
>
>
> With NUTCH-1712 the behavior of the Injector has changed in case new URLs are
> added to an existing CrawlDb:
> - before only injected URLs were filtered and normalized
> - now filters and normalizers are applied to all URLs including those already
> in the CrawlDb
> The default should be as before not to filter existing URLs. Filtering and
> normalizing may take long for large CrawlDbs and/or complex URL filters. If
> URL filter or normalizer rules are not changed there is no need to apply them
> anew every time new URLs are added. Of course, injected URLs should be
> filtered and normalized by default.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)