[
https://issues.apache.org/jira/browse/NUTCH-3081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891423#comment-17891423
]
Hiran Chaudhuri edited comment on NUTCH-3081 at 10/21/24 7:08 AM:
------------------------------------------------------------------
Just to make the problem a bit more clear (and to emphasize the warning on the
deprecated constructor is not the culprit), I created three examples. Here is
some java code that creates a URL with the deprecated constructor:
{code:java}
import java.net.URL;
public class Test1 {
public static void main(String[] args) throws Exception {
new URL("http://host/path");
System.out.println("Success!");
}
}
{code}
When I execute it, it just prints the success message. Let's change the URL and
see what happens:
{code:java}
import java.net.URL;
public class Test2 {
public static void main(String[] args) throws Exception {
new URL("foo://host/path");
System.out.println("Success!");
}
}
{code}
In this case we get
{{Exception in thread "main" java.net.MalformedURLException: unknown protocol:
foo}}
Obviously the JVM does not know how to handle the protocol {{{}foo{}}}. Looking
at the
[documentation|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/net/URL.html#%3Cinit%3E(java.lang.String,java.lang.String,int,java.lang.String)]
I can see
_Protocol handlers for the following protocols are guaranteed to exist on the
search path:_
* _{{http}}_
* _{{https}}_
* _{{file}}_
* _{{jar}}_
So obviously the foo protocol, just like the smb protocol are unknown. Let's
introduce the missing protocol in the next example:
{code:java}
import java.io.IOException;
import java.net.URL;
import java.net.URLConnection;
import java.net.URLStreamHandler;
import java.net.URLStreamHandlerFactory;
public class Test3 {
public static void main(String[] args) throws Exception {
URL.setURLStreamHandlerFactory(new URLStreamHandlerFactory() {
@Override
public URLStreamHandler createURLStreamHandler(String string) {
System.out.println("createURLStreamHandler - " + string);
if ("foo".equals(string)) {
return new URLStreamHandler() {
@Override
protected URLConnection openConnection(URL url) throws
IOException {
throw new UnsupportedOperationException("Not
supported yet.");
}
};
}
return null;
}
});
new URL("foo://host/path");
System.out.println("Success!");
}
}
{code}
Here the code prints the success message again. This time the URL could be
constructed because the URLStreamHandler for the protocol was found. We did not
open the connection - but if we did we would receive an
UnsupportedOperationException. A real implementation would likely do otherwise.
But this is what happens with the protocol plugins. They extend the JVM's
capability to handle different URL schemes. And that's why I can directly tell
that the plugins were not loaded - otherwise the smb protocol would have been
known.
was (Author: hiranchaudhuri):
Just to make the problem a bit more clear (and to emphasize the warning on the
deprecated constructor is not the culprit), I created three examples. Here is
some java code that creates a URL with the deprecated constructor:
{code:java}
import java.net.URL;public class Test1 {
public static void main(String[] args) throws Exception {
new URL("http://host/path");
System.out.println("Success!");
}
}
{code}
When I execute it, it just prints the success message. Let's change the URL and
see what happens:
{code:java}
import java.net.URL;public class Test2 {
public static void main(String[] args) throws Exception {
new URL("foo://host/path");
System.out.println("Success!");
}
}
{code}
In this case we get
{{Exception in thread "main" java.net.MalformedURLException: unknown protocol:
foo}}
Obviously the JVM does not know how to handle the protocol {{{}foo{}}}. Looking
at the
[documentation|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/net/URL.html#%3Cinit%3E(java.lang.String,java.lang.String,int,java.lang.String)]
I can see
_Protocol handlers for the following protocols are guaranteed to exist on the
search path:_
* _{{http}}_
* _{{https}}_
* _{{file}}_
* _{{jar}}_
So obviously the foo protocol, just like the smb protocol are unknown. Let's
introduce the missing protocol in the next example:
{code:java}
import java.io.IOException;
import java.net.URL;
import java.net.URLConnection;
import java.net.URLStreamHandler;
import java.net.URLStreamHandlerFactory;public class Test3 {
public static void main(String[] args) throws Exception {
URL.setURLStreamHandlerFactory(new URLStreamHandlerFactory() {
@Override
public URLStreamHandler createURLStreamHandler(String string) {
System.out.println("createURLStreamHandler - " + string);
if ("foo".equals(string)) {
return new URLStreamHandler() {
@Override
protected URLConnection openConnection(URL url) throws
IOException {
throw new UnsupportedOperationException("Not
supported yet.");
}
};
}
return null;
}
});
new URL("foo://host/path");
System.out.println("Success!");
}
}
{code}
Here the code prints the success message again. This time the URL could be
constructed because the URLStreamHandler for the protocol was found. We did not
open the connection - but if we did we would receive an
UnsupportedOperationException. A real implementation would likely do otherwise.
But this is what happens with the protocol plugins. They extend the JVM's
capability to handle different URL schemes. And that's why I can directly tell
that the plugins were not loaded - otherwise the smb protocol would have been
known.
> Crawlcomplete command does not load plugins
> -------------------------------------------
>
> Key: NUTCH-3081
> URL: https://issues.apache.org/jira/browse/NUTCH-3081
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.21
> Environment: Ubuntu 22 LTS
> openjdk version "21.0.4" 2024-07-16
> OpenJDK Runtime Environment (build 21.0.4+7-Ubuntu-1ubuntu222.04)
> OpenJDK 64-Bit Server VM (build 21.0.4+7-Ubuntu-1ubuntu222.04, mixed mode,
> sharing)
> Reporter: Hiran Chaudhuri
> Priority: Major
>
> So I am running Nutch to scan my Synology NAS via the protocol-smb plugin.
> The scan is running nicely in the background and content in Solr is growing.
> To check how far the scanning progressed I try out the crawlcomplete command
> like so:
> {{./nutch/runtime/local/bin/nutch crawlcomplete -inputDirs ./crawl/crawldb
> -mode host -outputDir crawl/dump/}}
>
> But to my surprise I do not get a dump of the URLs including the fetch
> status, or some statistics with counters but errors related to the unknown
> smb protocol:
> {{2024-10-16 23:02:40,425 INFO org.apache.nutch.util.CrawlCompletionStats
> [main] CrawlCompletionStats: starting}}
> {{2024-10-16 23:02:40,990 ERROR org.apache.nutch.util.CrawlCompletionStats
> [LocalJobRunner Map Task Executor #0|#0] Failed to get host or domain from
> URL smb://[email protected]/Documents: unknown protocol: smb}}
> {{2024-10-16 23:02:40,991 ERROR org.apache.nutch.util.CrawlCompletionStats
> [LocalJobRunner Map Task Executor #0|#0] Failed to get host or domain from
> URL smb://[email protected]/Documents/.htaccess: unknown protocol: smb}}
> The nutch configuration is correct, all the other tools load plugins and log
> doing so to stdout. With crawlcomplete there is no such output, and the smb
> protocol is unknown. It looks like plugin configuration is completely ignored.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)