[ 
https://issues.apache.org/jira/browse/NUTCH-3081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891423#comment-17891423
 ] 

Hiran Chaudhuri edited comment on NUTCH-3081 at 10/21/24 7:07 AM:
------------------------------------------------------------------

Just to make the problem a bit more clear (and to emphasize the warning on the 
deprecated constructor is not the culprit), I created three examples. Here is 
some java code that creates a URL with the deprecated constructor:

 
{code:java}
import java.net.URL;public class Test1 {
    public static void main(String[] args) throws Exception {
        new URL("http://host/path";);
        System.out.println("Success!");
    }
}
 {code}
When I execute it, it just prints the success message. Let's change the URL and 
see what happens:

 

 
{code:java}
import java.net.URL;public class Test2 {
    public static void main(String[] args) throws Exception {
        new URL("foo://host/path");
        System.out.println("Success!");
    }
}
 {code}
In this case we get

{{Exception in thread "main" java.net.MalformedURLException: unknown protocol: 
foo}}

Obviously the JVM does not know how to handle the protocol {{{}foo{}}}. Looking 
at the 
[documentation|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/net/URL.html#%3Cinit%3E(java.lang.String,java.lang.String,int,java.lang.String)]
  I can see

_Protocol handlers for the following protocols are guaranteed to exist on the 
search path:_
 * _{{http}}_
 * _{{https}}_
 * _{{file}}_
 * _{{jar}}_

So obviously the foo protocol, just like the smb protocol are unknown. Let's 
introduce the protocol if it is missing in the next example:
{code:java}
import java.io.IOException;
import java.net.URL;
import java.net.URLConnection;
import java.net.URLStreamHandler;
import java.net.URLStreamHandlerFactory;public class Test3 {
    public static void main(String[] args) throws Exception {
        URL.setURLStreamHandlerFactory(new URLStreamHandlerFactory() {
            @Override
            public URLStreamHandler createURLStreamHandler(String string) {
                System.out.println("createURLStreamHandler - " + string);
                if ("foo".equals(string)) {
                    return new URLStreamHandler() {
                        @Override
                        protected URLConnection openConnection(URL url) throws 
IOException {
                            throw new UnsupportedOperationException("Not 
supported yet.");
                        }
                        
                    };
                }
                return null;
            }
        });
        
        new URL("foo://host/path");
        System.out.println("Success!");
    }
}
 {code}
Here the code prints the success message again. This time the URL could be 
constructed because the URLStreamHandler for the protocol was found. We did not 
open the connection - but if we did we would receive an 
UnsupportedOperationException. A real implementation would likely do otherwise.

But this is what happens with the protocol plugins. They extend the JVM's 
capability to handle different URL schemes. And that's why I can directly tell 
that the plugins were not loaded - otherwise the smb protocol would have been 
known.


was (Author: hiranchaudhuri):
Just to make the problem a bit more clear (and to emphasize the warning on the 
deprecated constructor is not the culprit), I created three examples. Here is 
some java code that creates a URL with the deprecated constructor:

 
{code:java}
import java.net.URL;public class Test1 {
    public static void main(String[] args) throws Exception {
        new URL("http://host/path";);
        System.out.println("Success!");
    }
}
 {code}
When I execute it, it just prints the success message. Let's change the URL and 
see what happens:

 

 
{code:java}
import java.net.URL;public class Test2 {
    public static void main(String[] args) throws Exception {
        new URL("foo://host/path");
        System.out.println("Success!");
    }
}
 {code}
In this case we get

 

{{Exception in thread "main" java.net.MalformedURLException: unknown protocol: 
foo}}

Obviously the JVM does not know how to handle the protocol {{{}foo{}}}. Looking 
at the 
[documentation|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/net/URL.html#%3Cinit%3E(java.lang.String,java.lang.String,int,java.lang.String)]
  I can see

_Protocol handlers for the following protocols are guaranteed to exist on the 
search path:_
 * _{{http}}_
 * _{{https}}_
 * _{{file}}_
 * _{{jar}}_

So obviously the foo protocol, just like the smb protocol are unknown. Let's 
introduce the protocol if it is missing in the next example:
{code:java}
import java.io.IOException;
import java.net.URL;
import java.net.URLConnection;
import java.net.URLStreamHandler;
import java.net.URLStreamHandlerFactory;public class Test3 {
    public static void main(String[] args) throws Exception {
        URL.setURLStreamHandlerFactory(new URLStreamHandlerFactory() {
            @Override
            public URLStreamHandler createURLStreamHandler(String string) {
                System.out.println("createURLStreamHandler - " + string);
                if ("foo".equals(string)) {
                    return new URLStreamHandler() {
                        @Override
                        protected URLConnection openConnection(URL url) throws 
IOException {
                            throw new UnsupportedOperationException("Not 
supported yet.");
                        }
                        
                    };
                }
                return null;
            }
        });
        
        new URL("foo://host/path");
        System.out.println("Success!");
    }
}
 {code}
Here the code prints the success message again. This time the URL could be 
constructed because the URLStreamHandler for the protocol was found. We did not 
open the connection - but if we did we would receive an 
UnsupportedOperationException. A real implementation would likely do otherwise.

But this is what happens with the protocol plugins. They extend the JVM's 
capability to handle different URL schemes. And that's why I can directly tell 
that the plugins were not loaded - otherwise the smb protocol would have been 
known.

> Crawlcomplete command does not load plugins
> -------------------------------------------
>
>                 Key: NUTCH-3081
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3081
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.21
>         Environment: Ubuntu 22 LTS
> openjdk version "21.0.4" 2024-07-16
> OpenJDK Runtime Environment (build 21.0.4+7-Ubuntu-1ubuntu222.04)
> OpenJDK 64-Bit Server VM (build 21.0.4+7-Ubuntu-1ubuntu222.04, mixed mode, 
> sharing)
>            Reporter: Hiran Chaudhuri
>            Priority: Major
>
> So I am running Nutch to scan my Synology NAS via the protocol-smb plugin. 
> The scan is running nicely in the background and content in Solr is growing.
> To check how far the scanning progressed I try out the crawlcomplete command 
> like so:
> {{./nutch/runtime/local/bin/nutch crawlcomplete -inputDirs ./crawl/crawldb 
> -mode host -outputDir crawl/dump/}}
>  
> But to my surprise I do not get a dump of the URLs including the fetch 
> status, or some statistics with counters but errors related to the unknown 
> smb protocol:
> {{2024-10-16 23:02:40,425 INFO org.apache.nutch.util.CrawlCompletionStats 
> [main] CrawlCompletionStats: starting}}
> {{2024-10-16 23:02:40,990 ERROR org.apache.nutch.util.CrawlCompletionStats 
> [LocalJobRunner Map Task Executor #0|#0] Failed to get host or domain from 
> URL smb://hi...@nas.fritz.box/Documents: unknown protocol: smb}}
> {{2024-10-16 23:02:40,991 ERROR org.apache.nutch.util.CrawlCompletionStats 
> [LocalJobRunner Map Task Executor #0|#0] Failed to get host or domain from 
> URL smb://hi...@nas.fritz.box/Documents/.htaccess: unknown protocol: smb}}
> The nutch configuration is correct, all the other tools load plugins and log 
> doing so to stdout. With crawlcomplete there is no such output, and the smb 
> protocol is unknown. It looks like plugin configuration is completely ignored.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to