Hiran Chaudhuri created NUTCH-3088:
--------------------------------------
Summary: Parsechecker command does not load plugins
Key: NUTCH-3088
URL: https://issues.apache.org/jira/browse/NUTCH-3088
Project: Nutch
Issue Type: Bug
Affects Versions: 1.21
Environment: Ubuntu 22 LTS
openjdk version "21.0.4" 2024-07-16
OpenJDK Runtime Environment (build 21.0.4+7-Ubuntu-1ubuntu222.04)
OpenJDK 64-Bit Server VM (build 21.0.4+7-Ubuntu-1ubuntu222.04, mixed mode,
sharing)
Reporter: Hiran Chaudhuri
So I am running Nutch to scan my Synology NAS via the protocol-smb plugin. The
scan is running nicely in the background and content in Solr is growing.
To check how far the scanning progressed I try out the crawlcomplete command
like so:
{{./nutch/runtime/local/bin/nutch crawlcomplete -inputDirs ./crawl/crawldb
-mode host -outputDir crawl/dump/}}
But to my surprise I do not get a dump of the URLs including the fetch status,
or some statistics with counters but errors related to the unknown smb protocol:
{{2024-10-16 23:02:40,425 INFO org.apache.nutch.util.CrawlCompletionStats
[main] CrawlCompletionStats: starting}}
{{2024-10-16 23:02:40,990 ERROR org.apache.nutch.util.CrawlCompletionStats
[LocalJobRunner Map Task Executor #0|#0] Failed to get host or domain from URL
smb://[email protected]/Documents: unknown protocol: smb}}
{{2024-10-16 23:02:40,991 ERROR org.apache.nutch.util.CrawlCompletionStats
[LocalJobRunner Map Task Executor #0|#0] Failed to get host or domain from URL
smb://[email protected]/Documents/.htaccess: unknown protocol: smb}}
The nutch configuration is correct, all the other tools load plugins and log
doing so to stdout. With crawlcomplete there is no such output, and the smb
protocol is unknown. It looks like plugin configuration is completely ignored.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)