[ https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554531#comment-17554531 ]
Sebastian Nagel commented on NUTCH-2936: ---------------------------------------- After debugging this: the call by the Hadoop MR Job to initialize the KeyGenerator leads twice recursively into Nutch's URLStreamHandlerFactory - first for the "http" protocol to create a [NULL_URL (http://null.oracle.com/)|https://github.com/openjdk/jdk/blob/0530f4e517be5d5b3ff10be8a0764e564f068c06/src/java.base/share/classes/javax/crypto/JceSecurity.java.template#L246], second for the "jar" to load the protocol-okhttp.jar. See the debug log output and the stack trace (some lines stripped): {noformat} 2022-06-14 16:56:59,176 DEBUG plugin.URLStreamHandlerFactory: Registered URLStreamHandlerFactory with the JVM. 2022-06-14 16:56:59,994 DEBUG plugin.URLStreamHandlerFactory: Creating URLStreamHandler for protocol: http 2022-06-14 16:56:59,994 DEBUG plugin.PluginRepository: Creating URLStreamHandler for protocol: http 2022-06-14 16:56:59,995 DEBUG plugin.PluginRepository: Suitable protocolName attribute located: http 2022-06-14 16:57:00,007 DEBUG plugin.URLStreamHandlerFactory: Creating URLStreamHandler for protocol: jar 2022-06-14 16:57:00,007 DEBUG plugin.PluginRepository: Creating URLStreamHandler for protocol: jar 2022-06-14 16:57:00,008 DEBUG plugin.PluginRepository: No suitable protocol extensions registered for protocol: jar 2022-06-14 16:57:00,320 DEBUG plugin.PluginRepository: Located extension instance class: org.apache.nutch.protocol.okhttp.OkHttp 2022-06-14 16:57:00,320 DEBUG plugin.PluginRepository: Suitable protocol extension found that did not declare a handler {noformat} {noformat} at org.apache.nutch.plugin.PluginRepository.createURLStreamHandler(PluginRepository.java:583) at org.apache.nutch.plugin.URLStreamHandlerFactory.createURLStreamHandler(URLStreamHandlerFactory.java:95) at java.base/java.net.URL.getURLStreamHandler(URL.java:1432) at java.base/java.net.URL.<init>(URL.java:451) at java.base/jdk.internal.loader.URLClassPath$JarLoader.<init>(URLClassPath.java:720) at java.base/jdk.internal.loader.URLClassPath$3.run(URLClassPath.java:494) at java.base/jdk.internal.loader.URLClassPath$3.run(URLClassPath.java:477) at java.base/java.security.AccessController.doPrivileged(Native Method) at java.base/jdk.internal.loader.URLClassPath.getLoader(URLClassPath.java:476) at java.base/jdk.internal.loader.URLClassPath.getLoader(URLClassPath.java:445) at java.base/jdk.internal.loader.URLClassPath.getResource(URLClassPath.java:314) at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:455) at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:452) at java.base/java.security.AccessController.doPrivileged(Native Method) at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:451) at org.apache.nutch.plugin.PluginClassLoader.loadClass(PluginClassLoader.java:71) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522) at org.apache.nutch.plugin.PluginRepository.getCachedClass(PluginRepository.java:349) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:165) at org.apache.nutch.plugin.PluginRepository.createURLStreamHandler(PluginRepository.java:601) at org.apache.nutch.plugin.URLStreamHandlerFactory.createURLStreamHandler(URLStreamHandlerFactory.java:95) at java.base/java.net.URL.getURLStreamHandler(URL.java:1432) at java.base/java.net.URL.<init>(URL.java:651) at java.base/java.net.URL.<init>(URL.java:541) at java.base/java.net.URL.<init>(URL.java:488) at java.base/javax.crypto.JceSecurity.<clinit>(JceSecurity.java:239) at java.base/javax.crypto.KeyGenerator.nextSpi(KeyGenerator.java:363) at java.base/javax.crypto.KeyGenerator.<init>(KeyGenerator.java:176) at java.base/javax.crypto.KeyGenerator.getInstance(KeyGenerator.java:244) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:179) ... at org.apache.hadoop.mapreduce.Job.submit(Job.java:1568) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1589) at org.apache.nutch.crawl.Injector.inject(Injector.java:436) at org.apache.nutch.crawl.Injector.run(Injector.java:569) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:81) at org.apache.nutch.crawl.Injector.main(Injector.java:533) ... {noformat} I do not understand why the initialization of the KeyGenerator fails only in this combination (distributed mode and using protocol-okhttp). Nevertheless, we should never delegate standard URLStreamHandlers implemented by the JVM to handlers requiring the Nutch plugin system with its complexity and the plugin-specific class loaders. This may break stuff unexpectedly, esp. if stream handlers are used to open connections. See also HADOOP-14598 for a similar issue. > Early registration of URL stream handlers provided by plugins may fail Hadoop > jobs running in distributed mode if protocol-okhttp is used > ----------------------------------------------------------------------------------------------------------------------------------------- > > Key: NUTCH-2936 > URL: https://issues.apache.org/jira/browse/NUTCH-2936 > Project: Nutch > Issue Type: Bug > Components: plugin, protocol > Affects Versions: 1.19 > Reporter: Sebastian Nagel > Assignee: Lewis John McGibbney > Priority: Blocker > Fix For: 1.19 > > > After merging NUTCH-2429 I've observed that Nutch jobs running in distributed > mode may fail early with the following dubious error: > {noformat} > 2022-01-14 13:11:45,751 ERROR crawl.DedupRedirectsJob: DeduplicationJob: > java.io.IOException: Error generating shuffle secret key > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:182) > at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1565) > at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1562) > at java.base/java.security.AccessController.doPrivileged(Native > Method) > at java.base/javax.security.auth.Subject.doAs(Subject.java:423) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1562) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1583) > at > org.apache.nutch.crawl.DedupRedirectsJob.run(DedupRedirectsJob.java:301) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at > org.apache.nutch.crawl.DedupRedirectsJob.main(DedupRedirectsJob.java:379) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) > at org.apache.hadoop.util.RunJar.run(RunJar.java:323) > at org.apache.hadoop.util.RunJar.main(RunJar.java:236) > Caused by: java.security.NoSuchAlgorithmException: HmacSHA1 KeyGenerator not > available > at java.base/javax.crypto.KeyGenerator.<init>(KeyGenerator.java:177) > at > java.base/javax.crypto.KeyGenerator.getInstance(KeyGenerator.java:244) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:179) > ... 16 more > {noformat} > After removing the early registration of URL stream handlers (see NUTCH-2429) > in NutchJob and NutchTool, the job starts without errors. > Notes: > - the job this error was observed a [custom de-duplication > job|https://github.com/commoncrawl/nutch/blob/cc/src/java/org/apache/nutch/crawl/DedupRedirectsJob.java] > to flag redirects pointing to the same target URL. But I'll try to reproduce > it with a standard Nutch job and in pseudo-distributed mode. > - should also verify whether registering URL stream handlers works at all in > distributed mode. Tasks are launched differently, not as NutchJob or > NutchTool. -- This message was sent by Atlassian Jira (v8.20.7#820007)