[ https://issues.apache.org/jira/browse/NUTCH-3081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891423#comment-17891423 ]
Hiran Chaudhuri edited comment on NUTCH-3081 at 10/21/24 7:07 AM: ------------------------------------------------------------------ Just to make the problem a bit more clear (and to emphasize the warning on the deprecated constructor is not the culprit), I created three examples. Here is some java code that creates a URL with the deprecated constructor: {code:java} import java.net.URL;public class Test1 { public static void main(String[] args) throws Exception { new URL("http://host/path"); System.out.println("Success!"); } } {code} When I execute it, it just prints the success message. Let's change the URL and see what happens: {code:java} import java.net.URL;public class Test2 { public static void main(String[] args) throws Exception { new URL("foo://host/path"); System.out.println("Success!"); } } {code} In this case we get {{Exception in thread "main" java.net.MalformedURLException: unknown protocol: foo}} Obviously the JVM does not know how to handle the protocol {{{}foo{}}}. Looking at the [documentation|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/net/URL.html#%3Cinit%3E(java.lang.String,java.lang.String,int,java.lang.String)] I can see _Protocol handlers for the following protocols are guaranteed to exist on the search path:_ * _{{http}}_ * _{{https}}_ * _{{file}}_ * _{{jar}}_ So obviously the foo protocol, just like the smb protocol are unknown. Let's introduce the protocol if it is missing in the next example: {code:java} import java.io.IOException; import java.net.URL; import java.net.URLConnection; import java.net.URLStreamHandler; import java.net.URLStreamHandlerFactory;public class Test3 { public static void main(String[] args) throws Exception { URL.setURLStreamHandlerFactory(new URLStreamHandlerFactory() { @Override public URLStreamHandler createURLStreamHandler(String string) { System.out.println("createURLStreamHandler - " + string); if ("foo".equals(string)) { return new URLStreamHandler() { @Override protected URLConnection openConnection(URL url) throws IOException { throw new UnsupportedOperationException("Not supported yet."); } }; } return null; } }); new URL("foo://host/path"); System.out.println("Success!"); } } {code} Here the code prints the success message again. This time the URL could be constructed because the URLStreamHandler for the protocol was found. We did not open the connection - but if we did we would receive an UnsupportedOperationException. A real implementation would likely do otherwise. But this is what happens with the protocol plugins. They extend the JVM's capability to handle different URL schemes. And that's why I can directly tell that the plugins were not loaded - otherwise the smb protocol would have been known. was (Author: hiranchaudhuri): Just to make the problem a bit more clear (and to emphasize the warning on the deprecated constructor is not the culprit), I created three examples. Here is some java code that creates a URL with the deprecated constructor: {code:java} import java.net.URL;public class Test1 { public static void main(String[] args) throws Exception { new URL("http://host/path"); System.out.println("Success!"); } } {code} When I execute it, it just prints the success message. Let's change the URL and see what happens: {code:java} import java.net.URL;public class Test2 { public static void main(String[] args) throws Exception { new URL("foo://host/path"); System.out.println("Success!"); } } {code} In this case we get {{Exception in thread "main" java.net.MalformedURLException: unknown protocol: foo}} Obviously the JVM does not know how to handle the protocol {{{}foo{}}}. Looking at the [documentation|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/net/URL.html#%3Cinit%3E(java.lang.String,java.lang.String,int,java.lang.String)] I can see _Protocol handlers for the following protocols are guaranteed to exist on the search path:_ * _{{http}}_ * _{{https}}_ * _{{file}}_ * _{{jar}}_ So obviously the foo protocol, just like the smb protocol are unknown. Let's introduce the protocol if it is missing in the next example: {code:java} import java.io.IOException; import java.net.URL; import java.net.URLConnection; import java.net.URLStreamHandler; import java.net.URLStreamHandlerFactory;public class Test3 { public static void main(String[] args) throws Exception { URL.setURLStreamHandlerFactory(new URLStreamHandlerFactory() { @Override public URLStreamHandler createURLStreamHandler(String string) { System.out.println("createURLStreamHandler - " + string); if ("foo".equals(string)) { return new URLStreamHandler() { @Override protected URLConnection openConnection(URL url) throws IOException { throw new UnsupportedOperationException("Not supported yet."); } }; } return null; } }); new URL("foo://host/path"); System.out.println("Success!"); } } {code} Here the code prints the success message again. This time the URL could be constructed because the URLStreamHandler for the protocol was found. We did not open the connection - but if we did we would receive an UnsupportedOperationException. A real implementation would likely do otherwise. But this is what happens with the protocol plugins. They extend the JVM's capability to handle different URL schemes. And that's why I can directly tell that the plugins were not loaded - otherwise the smb protocol would have been known. > Crawlcomplete command does not load plugins > ------------------------------------------- > > Key: NUTCH-3081 > URL: https://issues.apache.org/jira/browse/NUTCH-3081 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.21 > Environment: Ubuntu 22 LTS > openjdk version "21.0.4" 2024-07-16 > OpenJDK Runtime Environment (build 21.0.4+7-Ubuntu-1ubuntu222.04) > OpenJDK 64-Bit Server VM (build 21.0.4+7-Ubuntu-1ubuntu222.04, mixed mode, > sharing) > Reporter: Hiran Chaudhuri > Priority: Major > > So I am running Nutch to scan my Synology NAS via the protocol-smb plugin. > The scan is running nicely in the background and content in Solr is growing. > To check how far the scanning progressed I try out the crawlcomplete command > like so: > {{./nutch/runtime/local/bin/nutch crawlcomplete -inputDirs ./crawl/crawldb > -mode host -outputDir crawl/dump/}} > > But to my surprise I do not get a dump of the URLs including the fetch > status, or some statistics with counters but errors related to the unknown > smb protocol: > {{2024-10-16 23:02:40,425 INFO org.apache.nutch.util.CrawlCompletionStats > [main] CrawlCompletionStats: starting}} > {{2024-10-16 23:02:40,990 ERROR org.apache.nutch.util.CrawlCompletionStats > [LocalJobRunner Map Task Executor #0|#0] Failed to get host or domain from > URL smb://hi...@nas.fritz.box/Documents: unknown protocol: smb}} > {{2024-10-16 23:02:40,991 ERROR org.apache.nutch.util.CrawlCompletionStats > [LocalJobRunner Map Task Executor #0|#0] Failed to get host or domain from > URL smb://hi...@nas.fritz.box/Documents/.htaccess: unknown protocol: smb}} > The nutch configuration is correct, all the other tools load plugins and log > doing so to stdout. With crawlcomplete there is no such output, and the smb > protocol is unknown. It looks like plugin configuration is completely ignored. -- This message was sent by Atlassian Jira (v8.20.10#820010)