[ 
https://issues.apache.org/jira/browse/NUTCH-3081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891367#comment-17891367
 ] 

Lewis John McGibbney edited comment on NUTCH-3081 at 10/20/24 10:43 PM:
------------------------------------------------------------------------

Hi [~hiranchaudhuri] your job is failing at the [following line in 
CrawlCompletionStats.java#L205|https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/CrawlCompletionStats.java#L205].
 It looks like the [Java URL 
Class|https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/net/URL.html]
 is the culprit here. Maybe we are using it inappropriately. I remember when we 
wrote the tool, we were only working with URLs fetched via HTTP(s). 

I also ended up reading the [following 
discussion|https://github.com/hierynomus/smbj/issues/89] which recommends using 
[java.net.URI#create|https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/net/URI.html].
 We use the URI class in [lots of existing 
places|https://github.com/search?q=repo%3Aapache%2Fnutch+%22.toURL%28%29%22&type=code].

We can then transform the URI into a URL using the URI#toURL() method.


was (Author: lewismc):
Hi [~hiranchaudhuri] your job is failing at the [following line in 
CrawlCompletionStats.java#L205|https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/CrawlCompletionStats.java#L205].
 It looks like the [Java URL 
Class|https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/net/URL.html]
 is the culprit here. Maybe we are using it inappropriately. I remember when we 
wrote the tool, we were only working with URLs fetched via HTTP(s). 

I also ended up reading the [following 
discussion|https://github.com/hierynomus/smbj/issues/89] which recommends using 
[java.net.URI#create|https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/net/URI.html].
 We should look into that.

We can then transform the URI into a URL using the URI#toURL() method.

> Crawlcomplete command does not load plugins
> -------------------------------------------
>
>                 Key: NUTCH-3081
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3081
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.21
>         Environment: Ubuntu 22 LTS
> openjdk version "21.0.4" 2024-07-16
> OpenJDK Runtime Environment (build 21.0.4+7-Ubuntu-1ubuntu222.04)
> OpenJDK 64-Bit Server VM (build 21.0.4+7-Ubuntu-1ubuntu222.04, mixed mode, 
> sharing)
>            Reporter: Hiran Chaudhuri
>            Priority: Major
>
> So I am running Nutch to scan my Synology NAS via the protocol-smb plugin. 
> The scan is running nicely in the background and content in Solr is growing.
> To check how far the scanning progressed I try out the crawlcomplete command 
> like so:
> {{./nutch/runtime/local/bin/nutch crawlcomplete -inputDirs ./crawl/crawldb 
> -mode host -outputDir crawl/dump/}}
>  
> But to my surprise I do not get a dump of the URLs including the fetch 
> status, or some statistics with counters but errors related to the unknown 
> smb protocol:
> {{2024-10-16 23:02:40,425 INFO org.apache.nutch.util.CrawlCompletionStats 
> [main] CrawlCompletionStats: starting}}
> {{2024-10-16 23:02:40,990 ERROR org.apache.nutch.util.CrawlCompletionStats 
> [LocalJobRunner Map Task Executor #0|#0] Failed to get host or domain from 
> URL smb://hi...@nas.fritz.box/Documents: unknown protocol: smb}}
> {{2024-10-16 23:02:40,991 ERROR org.apache.nutch.util.CrawlCompletionStats 
> [LocalJobRunner Map Task Executor #0|#0] Failed to get host or domain from 
> URL smb://hi...@nas.fritz.box/Documents/.htaccess: unknown protocol: smb}}
> The nutch configuration is correct, all the other tools load plugins and log 
> doing so to stdout. With crawlcomplete there is no such output, and the smb 
> protocol is unknown. It looks like plugin configuration is completely ignored.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to