[
https://issues.apache.org/jira/browse/NUTCH-3081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17890538#comment-17890538
]
Hiran Chaudhuri commented on NUTCH-3081:
----------------------------------------
Since automated testing is mentioned in the other ticket:
How about testing with the protocol-foo plugin? That's what it was designed
for...
> Crawlcomplete command does not load plugins
> -------------------------------------------
>
> Key: NUTCH-3081
> URL: https://issues.apache.org/jira/browse/NUTCH-3081
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.21
> Environment: Ubuntu 22 LTS
> openjdk version "21.0.4" 2024-07-16
> OpenJDK Runtime Environment (build 21.0.4+7-Ubuntu-1ubuntu222.04)
> OpenJDK 64-Bit Server VM (build 21.0.4+7-Ubuntu-1ubuntu222.04, mixed mode,
> sharing)
> Reporter: Hiran Chaudhuri
> Priority: Major
>
> So I am running Nutch to scan my Synology NAS via the protocol-smb plugin.
> The scan is running nicely in the background and content in Solr is growing.
> To check how far the scanning progressed I try out the crawlcomplete command
> like so:
> {{./nutch/runtime/local/bin/nutch crawlcomplete -inputDirs ./crawl/crawldb
> -mode host -outputDir crawl/dump/}}
>
> But to my surprise I do not get a dump of the URLs including the fetch
> status, or some statistics with counters but errors related to the unknown
> smb protocol:
> {{2024-10-16 23:02:40,425 INFO org.apache.nutch.util.CrawlCompletionStats
> [main] CrawlCompletionStats: starting}}
> {{2024-10-16 23:02:40,990 ERROR org.apache.nutch.util.CrawlCompletionStats
> [LocalJobRunner Map Task Executor #0|#0] Failed to get host or domain from
> URL smb://[email protected]/Documents: unknown protocol: smb}}
> {{2024-10-16 23:02:40,991 ERROR org.apache.nutch.util.CrawlCompletionStats
> [LocalJobRunner Map Task Executor #0|#0] Failed to get host or domain from
> URL smb://[email protected]/Documents/.htaccess: unknown protocol: smb}}
> The nutch configuration is correct, all the other tools load plugins and log
> doing so to stdout. With crawlcomplete there is no such output, and the smb
> protocol is unknown. It looks like pluginn configuration is completely
> ignored.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)