[ 
https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17476734#comment-17476734
 ] 

ASF GitHub Bot commented on NUTCH-2936:
---------------------------------------

lewismc opened a new pull request #726:
URL: https://github.com/apache/nutch/pull/726


   I ended up producing this PR as a result of investigating NUTCH-2936. This 
PR does not fix NUTCH-2936.
   The problem is that the 
[trustAllSslSocketFactory](https://github.com/apache/nutch/blob/master/src/plugin/protocol-okhttp/src/java/org/apache/nutch/protocol/okhttp/OkHttp.java#L129)
 variable is `null` when passed into 
`okhttp3.OkHttpClient.Builder.sslSocketFactory`.
   I have not yet tried to verify if running protocol-okhttp is distributed 
mode. I can try that early next week. It might be worthwhile us reverting 
NUTCH-2429.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


> Early registration of URL stream handlers provided by plugins may fail Hadoop 
> jobs running in distributed mode
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-2936
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2936
>             Project: Nutch
>          Issue Type: Bug
>          Components: plugin, protocol
>    Affects Versions: 1.19
>            Reporter: Sebastian Nagel
>            Assignee: Lewis John McGibbney
>            Priority: Blocker
>             Fix For: 1.19
>
>
> After merging NUTCH-2429 I've observed that Nutch jobs running in distributed 
> mode may fail early with the following dubious error:
> {noformat}
> 2022-01-14 13:11:45,751 ERROR crawl.DedupRedirectsJob: DeduplicationJob: 
> java.io.IOException: Error generating shuffle secret key
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:182)
>         at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1565)
>         at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1562)
>         at java.base/java.security.AccessController.doPrivileged(Native 
> Method)
>         at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:1562)
>         at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1583)
>         at 
> org.apache.nutch.crawl.DedupRedirectsJob.run(DedupRedirectsJob.java:301)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>         at 
> org.apache.nutch.crawl.DedupRedirectsJob.main(DedupRedirectsJob.java:379)
>         at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>         at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
> Caused by: java.security.NoSuchAlgorithmException: HmacSHA1 KeyGenerator not 
> available
>         at java.base/javax.crypto.KeyGenerator.<init>(KeyGenerator.java:177)
>         at 
> java.base/javax.crypto.KeyGenerator.getInstance(KeyGenerator.java:244)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:179)
>         ... 16 more
> {noformat}
> After removing the early registration of URL stream handlers (see NUTCH-2429) 
> in NutchJob and NutchTool, the job starts without errors.
> Notes:
> - the job this error was observed a [custom de-duplication 
> job|https://github.com/commoncrawl/nutch/blob/cc/src/java/org/apache/nutch/crawl/DedupRedirectsJob.java]
>  to flag redirects pointing to the same target URL. But I'll try to reproduce 
> it with a standard Nutch job and in pseudo-distributed mode.
> - should also verify whether registering URL stream handlers works at all in 
> distributed mode. Tasks are launched differently, not as NutchJob or 
> NutchTool.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to