[
https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554859#comment-17554859
]
ASF GitHub Bot commented on NUTCH-2936:
---------------------------------------
lewismc commented on PR #733:
URL: https://github.com/apache/nutch/pull/733#issuecomment-1157149313
This is exciting!!! Excellent debugging 👍 ... you got further than me.
I can't get around to testing it until next week at earliest.
Thinking back, I did observe revisits (recursive access) to
URLStreamHandlerFactory but didn't pursue that line of inquiry at that point in
time.
To get a bit more context I did review
[HADOOP-14598-005.patch](https://issues.apache.org/jira/secure/attachment/12880380/HADOOP-14598-005.patch)
and the current class it affects. Reading the code it makes more sense but
admittedly until I debug this I still don't have the full context.
I took a look at [hadoop-hdfs
TestUrlStreamHandler.java](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/fs/TestUrlStreamHandler.java)
as well which I really like the look of. To build out some more confidence in
this aspect of the codebase, we could create some tests for the [nutch
URLStreamHandlerFactory.java](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/plugin/URLStreamHandlerFactory.java).
> Early registration of URL stream handlers provided by plugins may fail Hadoop
> jobs running in distributed mode if protocol-okhttp is used
> -----------------------------------------------------------------------------------------------------------------------------------------
>
> Key: NUTCH-2936
> URL: https://issues.apache.org/jira/browse/NUTCH-2936
> Project: Nutch
> Issue Type: Bug
> Components: plugin, protocol
> Affects Versions: 1.19
> Reporter: Sebastian Nagel
> Assignee: Lewis John McGibbney
> Priority: Blocker
> Fix For: 1.19
>
>
> After merging NUTCH-2429 I've observed that Nutch jobs running in distributed
> mode may fail early with the following dubious error:
> {noformat}
> 2022-01-14 13:11:45,751 ERROR crawl.DedupRedirectsJob: DeduplicationJob:
> java.io.IOException: Error generating shuffle secret key
> at
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:182)
> at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1565)
> at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1562)
> at java.base/java.security.AccessController.doPrivileged(Native
> Method)
> at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1562)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1583)
> at
> org.apache.nutch.crawl.DedupRedirectsJob.run(DedupRedirectsJob.java:301)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at
> org.apache.nutch.crawl.DedupRedirectsJob.main(DedupRedirectsJob.java:379)
> at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
> Caused by: java.security.NoSuchAlgorithmException: HmacSHA1 KeyGenerator not
> available
> at java.base/javax.crypto.KeyGenerator.<init>(KeyGenerator.java:177)
> at
> java.base/javax.crypto.KeyGenerator.getInstance(KeyGenerator.java:244)
> at
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:179)
> ... 16 more
> {noformat}
> After removing the early registration of URL stream handlers (see NUTCH-2429)
> in NutchJob and NutchTool, the job starts without errors.
> Notes:
> - the job this error was observed a [custom de-duplication
> job|https://github.com/commoncrawl/nutch/blob/cc/src/java/org/apache/nutch/crawl/DedupRedirectsJob.java]
> to flag redirects pointing to the same target URL. But I'll try to reproduce
> it with a standard Nutch job and in pseudo-distributed mode.
> - should also verify whether registering URL stream handlers works at all in
> distributed mode. Tasks are launched differently, not as NutchJob or
> NutchTool.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)