[ 
https://issues.apache.org/jira/browse/NUTCH-3081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hiran Chaudhuri updated NUTCH-3081:
-----------------------------------
    Affects Version/s: 1.21
          Environment: 
Ubuntu 22 LTS

openjdk version "21.0.4" 2024-07-16
OpenJDK Runtime Environment (build 21.0.4+7-Ubuntu-1ubuntu222.04)
OpenJDK 64-Bit Server VM (build 21.0.4+7-Ubuntu-1ubuntu222.04, mixed mode, 
sharing)

> Crawlcomplete command does not load plugins
> -------------------------------------------
>
>                 Key: NUTCH-3081
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3081
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.21
>         Environment: Ubuntu 22 LTS
> openjdk version "21.0.4" 2024-07-16
> OpenJDK Runtime Environment (build 21.0.4+7-Ubuntu-1ubuntu222.04)
> OpenJDK 64-Bit Server VM (build 21.0.4+7-Ubuntu-1ubuntu222.04, mixed mode, 
> sharing)
>            Reporter: Hiran Chaudhuri
>            Priority: Major
>
> So I am running Nutch to scan my Synology NAS via the protocol-smb plugin. 
> The scan is running nicely in the background and content in Solr is growing
> To check how far the scanning progressed I try out the crawlcomplete command 
> like so:
> {{./nutch/runtime/local/bin/nutch crawlcomplete -inputDirs ./crawl/crawldb 
> -mode host -outputDir crawl/dump/}}
>  
> But to my surprise I do not get a dump of the URLs including the fetch 
> status, or some statistics with counters but errors related to the unknown 
> smb protocol:
> {{2024-10-16 23:02:40,425 INFO org.apache.nutch.util.CrawlCompletionStats 
> [main] CrawlCompletionStats: starting}}
> {{2024-10-16 23:02:40,990 ERROR org.apache.nutch.util.CrawlCompletionStats 
> [LocalJobRunner Map Task Executor #0] Failed to get host or domain from URL 
> smb://hi...@nas.fritz.box/Documents: unknown protocol: smb}}
> {{2024-10-16 23:02:40,991 ERROR org.apache.nutch.util.CrawlCompletionStats 
> [LocalJobRunner Map Task Executor #0] Failed to get host or domain from URL 
> smb://hi...@nas.fritz.box/Documents/.htaccess: unknown protocol: smb}}
> The nutch configuration is correct, all the other tools load plugins and log 
> doing so to stdout. With crawlcomplete there is no such output, and the smb 
> protocol is unknown. It looks like pluginn configuration is completely 
> ignored.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to