Hiran Chaudhuri created NUTCH-3081:
--------------------------------------

             Summary: Crawlcomplete command does not load plugins
                 Key: NUTCH-3081
                 URL: https://issues.apache.org/jira/browse/NUTCH-3081
             Project: Nutch
          Issue Type: Bug
            Reporter: Hiran Chaudhuri


So I am running Nutch to scan my Synology NAS via the protocol-smb plugin. The 
scan is running nicely in the background and content in Solr is growing

To check how far the scanning progressed I try out the crawlcomplete command 
like so:

{{./nutch/runtime/local/bin/nutch crawlcomplete -inputDirs ./crawl/crawldb 
-mode host -outputDir crawl/dump/}}

 

But to my surprise I do not get a dump of the URLs including the fetch status, 
or some statistics with counters but errors related to the unknown smb protocol:

{{2024-10-16 23:02:40,425 INFO org.apache.nutch.util.CrawlCompletionStats 
[main] CrawlCompletionStats: starting}}
{{2024-10-16 23:02:40,990 ERROR org.apache.nutch.util.CrawlCompletionStats 
[LocalJobRunner Map Task Executor #0] Failed to get host or domain from URL 
smb://hi...@nas.fritz.box/Documents: unknown protocol: smb}}
{{2024-10-16 23:02:40,991 ERROR org.apache.nutch.util.CrawlCompletionStats 
[LocalJobRunner Map Task Executor #0] Failed to get host or domain from URL 
smb://hi...@nas.fritz.box/Documents/.htaccess: unknown protocol: smb}}

The nutch configuration is correct, all the other tools load plugins and log 
doing so to stdout. With crawlcomplete there is no such output, and the smb 
protocol is unknown. It looks like pluginn configuration is completely ignored.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to