[
https://issues.apache.org/jira/browse/NUTCH-3081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891367#comment-17891367
]
Lewis John McGibbney edited comment on NUTCH-3081 at 10/20/24 10:43 PM:
------------------------------------------------------------------------
Hi [~hiranchaudhuri] your job is failing at the [following line in
CrawlCompletionStats.java#L205|https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/CrawlCompletionStats.java#L205].
It looks like the [Java URL
Class|https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/net/URL.html]
is the culprit here. Maybe we are using it inappropriately. I remember when we
wrote the tool, we were only working with URLs fetched via HTTP(s).
I also ended up reading the [following
discussion|https://github.com/hierynomus/smbj/issues/89] which recommends using
[java.net.URI#create|https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/net/URI.html].
We use the URI class in [lots of existing
places|https://github.com/search?q=repo%3Aapache%2Fnutch+%22.toURL%28%29%22&type=code].
We can then transform the URI into a URL using the URI#toURL() method.
was (Author: lewismc):
Hi [~hiranchaudhuri] your job is failing at the [following line in
CrawlCompletionStats.java#L205|https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/CrawlCompletionStats.java#L205].
It looks like the [Java URL
Class|https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/net/URL.html]
is the culprit here. Maybe we are using it inappropriately. I remember when we
wrote the tool, we were only working with URLs fetched via HTTP(s).
I also ended up reading the [following
discussion|https://github.com/hierynomus/smbj/issues/89] which recommends using
[java.net.URI#create|https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/net/URI.html].
We should look into that.
We can then transform the URI into a URL using the URI#toURL() method.
> Crawlcomplete command does not load plugins
> -------------------------------------------
>
> Key: NUTCH-3081
> URL: https://issues.apache.org/jira/browse/NUTCH-3081
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.21
> Environment: Ubuntu 22 LTS
> openjdk version "21.0.4" 2024-07-16
> OpenJDK Runtime Environment (build 21.0.4+7-Ubuntu-1ubuntu222.04)
> OpenJDK 64-Bit Server VM (build 21.0.4+7-Ubuntu-1ubuntu222.04, mixed mode,
> sharing)
> Reporter: Hiran Chaudhuri
> Priority: Major
>
> So I am running Nutch to scan my Synology NAS via the protocol-smb plugin.
> The scan is running nicely in the background and content in Solr is growing.
> To check how far the scanning progressed I try out the crawlcomplete command
> like so:
> {{./nutch/runtime/local/bin/nutch crawlcomplete -inputDirs ./crawl/crawldb
> -mode host -outputDir crawl/dump/}}
>
> But to my surprise I do not get a dump of the URLs including the fetch
> status, or some statistics with counters but errors related to the unknown
> smb protocol:
> {{2024-10-16 23:02:40,425 INFO org.apache.nutch.util.CrawlCompletionStats
> [main] CrawlCompletionStats: starting}}
> {{2024-10-16 23:02:40,990 ERROR org.apache.nutch.util.CrawlCompletionStats
> [LocalJobRunner Map Task Executor #0|#0] Failed to get host or domain from
> URL smb://[email protected]/Documents: unknown protocol: smb}}
> {{2024-10-16 23:02:40,991 ERROR org.apache.nutch.util.CrawlCompletionStats
> [LocalJobRunner Map Task Executor #0|#0] Failed to get host or domain from
> URL smb://[email protected]/Documents/.htaccess: unknown protocol: smb}}
> The nutch configuration is correct, all the other tools load plugins and log
> doing so to stdout. With crawlcomplete there is no such output, and the smb
> protocol is unknown. It looks like plugin configuration is completely ignored.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)