[ https://issues.apache.org/jira/browse/NUTCH-3081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hiran Chaudhuri updated NUTCH-3081: ----------------------------------- Description: So I am running Nutch to scan my Synology NAS via the protocol-smb plugin. The scan is running nicely in the background and content in Solr is growing. To check how far the scanning progressed I try out the crawlcomplete command like so: {{./nutch/runtime/local/bin/nutch crawlcomplete -inputDirs ./crawl/crawldb -mode host -outputDir crawl/dump/}} But to my surprise I do not get a dump of the URLs including the fetch status, or some statistics with counters but errors related to the unknown smb protocol: {{2024-10-16 23:02:40,425 INFO org.apache.nutch.util.CrawlCompletionStats [main] CrawlCompletionStats: starting}} {{2024-10-16 23:02:40,990 ERROR org.apache.nutch.util.CrawlCompletionStats [LocalJobRunner Map Task Executor #0|#0] Failed to get host or domain from URL smb://hi...@nas.fritz.box/Documents: unknown protocol: smb}} {{2024-10-16 23:02:40,991 ERROR org.apache.nutch.util.CrawlCompletionStats [LocalJobRunner Map Task Executor #0|#0] Failed to get host or domain from URL smb://hi...@nas.fritz.box/Documents/.htaccess: unknown protocol: smb}} The nutch configuration is correct, all the other tools load plugins and log doing so to stdout. With crawlcomplete there is no such output, and the smb protocol is unknown. It looks like plugin configuration is completely ignored. was: So I am running Nutch to scan my Synology NAS via the protocol-smb plugin. The scan is running nicely in the background and content in Solr is growing. To check how far the scanning progressed I try out the crawlcomplete command like so: {{./nutch/runtime/local/bin/nutch crawlcomplete -inputDirs ./crawl/crawldb -mode host -outputDir crawl/dump/}} But to my surprise I do not get a dump of the URLs including the fetch status, or some statistics with counters but errors related to the unknown smb protocol: {{2024-10-16 23:02:40,425 INFO org.apache.nutch.util.CrawlCompletionStats [main] CrawlCompletionStats: starting}} {{2024-10-16 23:02:40,990 ERROR org.apache.nutch.util.CrawlCompletionStats [LocalJobRunner Map Task Executor #0|#0] Failed to get host or domain from URL smb://hi...@nas.fritz.box/Documents: unknown protocol: smb}} {{2024-10-16 23:02:40,991 ERROR org.apache.nutch.util.CrawlCompletionStats [LocalJobRunner Map Task Executor #0|#0] Failed to get host or domain from URL smb://hi...@nas.fritz.box/Documents/.htaccess: unknown protocol: smb}} The nutch configuration is correct, all the other tools load plugins and log doing so to stdout. With crawlcomplete there is no such output, and the smb protocol is unknown. It looks like pluginn configuration is completely ignored. > Crawlcomplete command does not load plugins > ------------------------------------------- > > Key: NUTCH-3081 > URL: https://issues.apache.org/jira/browse/NUTCH-3081 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.21 > Environment: Ubuntu 22 LTS > openjdk version "21.0.4" 2024-07-16 > OpenJDK Runtime Environment (build 21.0.4+7-Ubuntu-1ubuntu222.04) > OpenJDK 64-Bit Server VM (build 21.0.4+7-Ubuntu-1ubuntu222.04, mixed mode, > sharing) > Reporter: Hiran Chaudhuri > Priority: Major > > So I am running Nutch to scan my Synology NAS via the protocol-smb plugin. > The scan is running nicely in the background and content in Solr is growing. > To check how far the scanning progressed I try out the crawlcomplete command > like so: > {{./nutch/runtime/local/bin/nutch crawlcomplete -inputDirs ./crawl/crawldb > -mode host -outputDir crawl/dump/}} > > But to my surprise I do not get a dump of the URLs including the fetch > status, or some statistics with counters but errors related to the unknown > smb protocol: > {{2024-10-16 23:02:40,425 INFO org.apache.nutch.util.CrawlCompletionStats > [main] CrawlCompletionStats: starting}} > {{2024-10-16 23:02:40,990 ERROR org.apache.nutch.util.CrawlCompletionStats > [LocalJobRunner Map Task Executor #0|#0] Failed to get host or domain from > URL smb://hi...@nas.fritz.box/Documents: unknown protocol: smb}} > {{2024-10-16 23:02:40,991 ERROR org.apache.nutch.util.CrawlCompletionStats > [LocalJobRunner Map Task Executor #0|#0] Failed to get host or domain from > URL smb://hi...@nas.fritz.box/Documents/.htaccess: unknown protocol: smb}} > The nutch configuration is correct, all the other tools load plugins and log > doing so to stdout. With crawlcomplete there is no such output, and the smb > protocol is unknown. It looks like plugin configuration is completely ignored. -- This message was sent by Atlassian Jira (v8.20.10#820010)