[ https://issues.apache.org/jira/browse/NUTCH-3081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891367#comment-17891367 ]
Lewis John McGibbney edited comment on NUTCH-3081 at 10/20/24 10:43 PM: ------------------------------------------------------------------------ Hi [~hiranchaudhuri] your job is failing at the [following line in CrawlCompletionStats.java#L205|https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/CrawlCompletionStats.java#L205]. It looks like the [Java URL Class|https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/net/URL.html] is the culprit here. Maybe we are using it inappropriately. I remember when we wrote the tool, we were only working with URLs fetched via HTTP(s). I also ended up reading the [following discussion|https://github.com/hierynomus/smbj/issues/89] which recommends using [java.net.URI#create|https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/net/URI.html]. We use the URI class in [lots of existing places|https://github.com/search?q=repo%3Aapache%2Fnutch+%22.toURL%28%29%22&type=code]. We can then transform the URI into a URL using the URI#toURL() method. was (Author: lewismc): Hi [~hiranchaudhuri] your job is failing at the [following line in CrawlCompletionStats.java#L205|https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/CrawlCompletionStats.java#L205]. It looks like the [Java URL Class|https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/net/URL.html] is the culprit here. Maybe we are using it inappropriately. I remember when we wrote the tool, we were only working with URLs fetched via HTTP(s). I also ended up reading the [following discussion|https://github.com/hierynomus/smbj/issues/89] which recommends using [java.net.URI#create|https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/net/URI.html]. We should look into that. We can then transform the URI into a URL using the URI#toURL() method. > Crawlcomplete command does not load plugins > ------------------------------------------- > > Key: NUTCH-3081 > URL: https://issues.apache.org/jira/browse/NUTCH-3081 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.21 > Environment: Ubuntu 22 LTS > openjdk version "21.0.4" 2024-07-16 > OpenJDK Runtime Environment (build 21.0.4+7-Ubuntu-1ubuntu222.04) > OpenJDK 64-Bit Server VM (build 21.0.4+7-Ubuntu-1ubuntu222.04, mixed mode, > sharing) > Reporter: Hiran Chaudhuri > Priority: Major > > So I am running Nutch to scan my Synology NAS via the protocol-smb plugin. > The scan is running nicely in the background and content in Solr is growing. > To check how far the scanning progressed I try out the crawlcomplete command > like so: > {{./nutch/runtime/local/bin/nutch crawlcomplete -inputDirs ./crawl/crawldb > -mode host -outputDir crawl/dump/}} > > But to my surprise I do not get a dump of the URLs including the fetch > status, or some statistics with counters but errors related to the unknown > smb protocol: > {{2024-10-16 23:02:40,425 INFO org.apache.nutch.util.CrawlCompletionStats > [main] CrawlCompletionStats: starting}} > {{2024-10-16 23:02:40,990 ERROR org.apache.nutch.util.CrawlCompletionStats > [LocalJobRunner Map Task Executor #0|#0] Failed to get host or domain from > URL smb://hi...@nas.fritz.box/Documents: unknown protocol: smb}} > {{2024-10-16 23:02:40,991 ERROR org.apache.nutch.util.CrawlCompletionStats > [LocalJobRunner Map Task Executor #0|#0] Failed to get host or domain from > URL smb://hi...@nas.fritz.box/Documents/.htaccess: unknown protocol: smb}} > The nutch configuration is correct, all the other tools load plugins and log > doing so to stdout. With crawlcomplete there is no such output, and the smb > protocol is unknown. It looks like plugin configuration is completely ignored. -- This message was sent by Atlassian Jira (v8.20.10#820010)