Sebastian Nagel created NUTCH-2949:
--------------------------------------
Summary: Tasks of a multi-threaded map runner may fail because of
slow creation of URL stream handlers
Key: NUTCH-2949
URL: https://issues.apache.org/jira/browse/NUTCH-2949
Project: Nutch
Issue Type: Bug
Components: net, plugin, protocol
Affects Versions: 1.19
Reporter: Sebastian Nagel
Fix For: 1.19
While running a custom Nutch job ([code
here|https://github.com/commoncrawl/nutch/blob/cc/src/java/org/apache/nutch/crawl/SitemapInjector.java]),
many but not all task failed exceeding the the Hadoop task time-out
(`mapreduce.task.timeout`) without generating any "heartbeat" (output, counter
increments, log messages). Hadoop logs the stacks of all threads of the timed
out task. That's the base for the excerpts below.
The job runs a MultithreadedMapper - most of the mapper threads (48 in total)
are waiting for the URLStreamHandler in order to construct a java.net.URL
object:
{noformat}
"Thread-11" #27 prio=5 os_prio=0 cpu=243.78ms elapsed=647.25s
tid=0x00007f3eb5b0f800 nid=0x8e651 waiting for monitor entry
[0x00007f3e84ef9000]
java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
- waiting to lock <0x00000006a1bc0630> (a java.lang.String)
at
org.apache.nutch.plugin.PluginRepository.createURLStreamHandler(PluginRepository.java:597)
at
org.apache.nutch.plugin.URLStreamHandlerFactory.createURLStreamHandler(URLStreamHandlerFactory.java:95)
at java.net.URL.getURLStreamHandler([email protected]/URL.java:1432)
at java.net.URL.<init>([email protected]/URL.java:651)
at java.net.URL.<init>([email protected]/URL.java:541)
at java.net.URL.<init>([email protected]/URL.java:488)
at
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.normalize(BasicURLNormalizer.java:179)
at
org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:318)
at
org.apache.nutch.crawl.Injector$InjectMapper.filterNormalize(Injector.java:157)
at
org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper$SitemapProcessor.getContent(SitemapInjector.java:670)
at
org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper$SitemapProcessor.process(SitemapInjector.java:439)
at
org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper.map(SitemapInjector.java:325)
at
org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper.map(SitemapInjector.java:145)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at
org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper$MapRunner.run(MultithreadedMapper.java:274)
{noformat}
Only a single mapper thread is active:
{noformat}
"Thread-23" #39 prio=5 os_prio=0 cpu=5830.17ms elapsed=647.09s
tid=0x00007f3eb5b42800 nid=0x8e661 in Object.wait() [0x00007f3e842ec000]
java.lang.Thread.State: RUNNABLE
at
jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0([email protected]/Native
Method)
at
jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance([email protected]/NativeConstructorAccessorImpl.java:62)
at
jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance([email protected]/DelegatingConstructorAccessorImpl.java:45)
at
java.lang.reflect.Constructor.newInstance([email protected]/Constructor.java:490)
at
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:170)
- locked <0x00000006a1bc0630> (a java.lang.String)
at
org.apache.nutch.plugin.PluginRepository.createURLStreamHandler(PluginRepository.java:597)
at
org.apache.nutch.plugin.URLStreamHandlerFactory.createURLStreamHandler(URLStreamHandlerFactory.java:95)
at java.net.URL.getURLStreamHandler([email protected]/URL.java:1432)
at java.net.URL.<init>([email protected]/URL.java:651)
at java.net.URL.<init>([email protected]/URL.java:541)
at java.net.URL.<init>([email protected]/URL.java:488)
at
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.normalize(BasicURLNormalizer.java:179)
at
org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:318)
at
org.apache.nutch.crawl.Injector$InjectMapper.filterNormalize(Injector.java:157)
at
org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper$SitemapProcessor.getContent(SitemapInjector.java:670)
at
org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper$SitemapProcessor.process(SitemapInjector.java:439)
at
org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper.map(SitemapInjector.java:325)
at
org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper.map(SitemapInjector.java:145)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at
org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper$MapRunner.run(MultithreadedMapper.java:274)
{noformat}
The stack of one thread shows a nested lock:
{noformat}
"Thread-26" #42 prio=5 os_prio=0 cpu=24.93ms elapsed=647.07s
tid=0x00007f3eb5b49800 nid=0x8e664 waiting for monitor entry
[0x00007f3e83fe7000]
java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
- waiting to lock <0x00000006a1bc0630> (a java.lang.String)
at
org.apache.nutch.plugin.PluginRepository.createURLStreamHandler(PluginRepository.java:597)
at
org.apache.nutch.plugin.URLStreamHandlerFactory.createURLStreamHandler(URLStreamHandlerFactory.java:95)
at java.net.URL.getURLStreamHandler([email protected]/URL.java:1432)
at java.net.URL.<init>([email protected]/URL.java:651)
at java.net.URL.<init>([email protected]/URL.java:541)
at java.net.URL.<init>([email protected]/URL.java:488)
at
javax.crypto.JceSecurity.<clinit>([email protected]/JceSecurity.java:239)
at javax.crypto.Cipher.getInstance([email protected]/Cipher.java:540)
at
sun.security.ssl.JsseJce.getCipher([email protected]/JsseJce.java:190)
at
sun.security.ssl.SSLCipher.isTransformationAvailable([email protected]/SSLCipher.java:509)
at
sun.security.ssl.SSLCipher.<init>([email protected]/SSLCipher.java:498)
at
sun.security.ssl.SSLCipher.<clinit>([email protected]/SSLCipher.java:81)
at
sun.security.ssl.CipherSuite.<clinit>([email protected]/CipherSuite.java:65)
at
sun.security.ssl.SSLContextImpl.getApplicableSupportedCipherSuites([email protected]/SSLContextImpl.java:348)
at
sun.security.ssl.SSLContextImpl$AbstractTLSContext.<clinit>([email protected]/SSLContextImpl.java:580)
at java.lang.Class.forName0([email protected]/Native Method)
at java.lang.Class.forName([email protected]/Class.java:315)
at
java.security.Provider$Service.getImplClass([email protected]/Provider.java:1918)
at
java.security.Provider$Service.newInstance([email protected]/Provider.java:1894)
at
sun.security.jca.GetInstance.getInstance([email protected]/GetInstance.java:236)
at
sun.security.jca.GetInstance.getInstance([email protected]/GetInstance.java:164)
at
javax.net.ssl.SSLContext.getInstance([email protected]/SSLContext.java:168)
at org.apache.nutch.protocol.okhttp.OkHttp.<clinit>(OkHttp.java:97)
at
jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0([email protected]/Native
Method)
at
jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance([email protected]/NativeConstructorAccessorImpl.java:62)
at
jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance([email protected]/DelegatingConstructorAccessorImpl.java:45)
at
java.lang.reflect.Constructor.newInstance([email protected]/Constructor.java:490)
at
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:170)
- locked <0x00000006a1bc0410> (a java.lang.String)
at
org.apache.nutch.plugin.PluginRepository.createURLStreamHandler(PluginRepository.java:597)
at
org.apache.nutch.plugin.URLStreamHandlerFactory.createURLStreamHandler(URLStreamHandlerFactory.java:95)
at java.net.URL.getURLStreamHandler([email protected]/URL.java:1432)
at java.net.URL.<init>([email protected]/URL.java:651)
at java.net.URL.<init>([email protected]/URL.java:541)
at java.net.URL.<init>([email protected]/URL.java:488)
at
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.normalize(BasicURLNormalizer.java:179)
at
org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:318)
at
org.apache.nutch.crawl.Injector$InjectMapper.filterNormalize(Injector.java:157)
at
org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper$SitemapProcessor.getContent(SitemapInjector.java:670)
at
org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper$SitemapProcessor.process(SitemapInjector.java:439)
at
org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper.map(SitemapInjector.java:325)
at
org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper.map(SitemapInjector.java:145)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at
org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper$MapRunner.run(MultithreadedMapper.java:274)
{noformat}
It isn't a dead-lock because the lock is on two different String objects in
different lines of code. However, a recursive barrier may make the performance
impact of locks much worse. The stack also shows that Nutch's
URLStreamHandlerFactory is also called from Java-internal methods. The
cryptographic classes in the stack could also give a hint what's going wrong in
NUTCH-2936.
I'm not yet clear what the solution could be:
-
[Extension.getExtensionInstance()|https://github.com/apache/nutch/blob/568993b908702b9631a705c8122ba99af1b72f04/src/java/org/apache/nutch/plugin/Extension.java#L154]
with nested synchronization is for sure not suitable to be called frequently
- maybe caching the URL stream handlers solves the problem
- but we should keep also NUTCH-2936 on the radar when looking for solutions
--
This message was sent by Atlassian Jira
(v8.20.7#820007)