[ 
https://issues.apache.org/jira/browse/NUTCH-3088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hiran Chaudhuri updated NUTCH-3088:
-----------------------------------
    Description: 
So I am running Nutch to scan my Synology NAS via the protocol-smb plugin. The 
scan is running nicely in the background and content in Solr is growing.
But the parsing phase throws a lot of exceptions. To check what may be wrong 
with those URLs I run the parsechecker like so:

{{./nutch/runtime/local/bin/nutch parsechecker 
"smb//[email protected]/Documents/Hiran/MyDocument.pdf"}}

 

But to my surprise I do not get the same parse exceptiojn but an error related 
to the unknown smb protocol:

{{2024-11-03 16:07:23,269 INFO org.apache.nutch.plugin.PluginManifestParser 
[main] Plugins: looking in: 
/home/hiran/NetBeansProjects/nutch/runtime/local/plugins}}
{{2024-11-03 16:07:23,339 INFO org.apache.nutch.plugin.PluginRepository [main] 
Plugin Auto-activation mode: [true]}}
{{2024-11-03 16:07:23,339 INFO org.apache.nutch.plugin.PluginRepository [main] 
Registered Plugins:}}
{{2024-11-03 16:07:23,339 INFO org.apache.nutch.plugin.PluginRepository [main]  
   Regex URL Filter (urlfilter-regex)}}
{{2024-11-03 16:07:23,339 INFO org.apache.nutch.plugin.PluginRepository [main]  
   Html Parse Plug-in (parse-html)}}
{{2024-11-03 16:07:23,339 INFO org.apache.nutch.plugin.PluginRepository [main]  
   the nutch core extension points (nutch-extensionpoints)}}
{{2024-11-03 16:07:23,339 INFO org.apache.nutch.plugin.PluginRepository [main]  
   Basic Indexing Filter (index-basic)}}
{{2024-11-03 16:07:23,339 INFO org.apache.nutch.plugin.PluginRepository [main]  
   Anchor Indexing Filter (index-anchor)}}
{{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository [main]  
   Tika Parser Plug-in (parse-tika)}}
{{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository [main]  
   Index Static (index-static)}}
{{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository [main]  
   Top Level Domain Plugin (tld)}}
{{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository [main]  
   Regex URL Filter Framework (lib-regex-filter)}}
{{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository [main]  
   Language Identification Parser/Filter (language-identifier)}}
{{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository [main]  
   Regex URL Normalizer (urlnormalizer-regex)}}
{{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository [main]  
   CyberNeko HTML Parser (lib-nekohtml)}}
{{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository [main]  
   Subcollection indexing and query filter (subcollection)}}
{{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository [main]  
   URL Meta Indexing Filter (urlmeta)}}
{{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository [main]  
   OPIC Scoring Plug-in (scoring-opic)}}
{{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository [main]  
   Pass-through URL Normalizer (urlnormalizer-pass)}}
{{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository [main]  
   SMB Protocol based on https://github.com/hierynomus/smbj (protocol-smb)}}
{{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository [main]  
   More Indexing Filter (index-more)}}
{{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository [main]  
   SolrIndexWriter (indexer-solr)}}
{{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository [main]  
   Creative Commons Plugins (creativecommons)}}
{{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository [main]  
   Replace Indexer (index-replace)}}
{{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository [main] 
Registered Extension-Points:}}
{{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository [main]  
    (Nutch Content Parser)}}
{{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository [main]  
    (Nutch URL Filter)}}
{{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository [main]  
    (HTML Parse Filter)}}
{{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository [main]  
    (Nutch Scoring)}}
{{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository [main]  
    (Nutch URL Normalizer)}}
{{2024-11-03 16:07:23,341 INFO org.apache.nutch.plugin.PluginRepository [main]  
    (Nutch Publisher)}}
{{2024-11-03 16:07:23,341 INFO org.apache.nutch.plugin.PluginRepository [main]  
    (Nutch Exchange)}}
{{2024-11-03 16:07:23,341 INFO org.apache.nutch.plugin.PluginRepository [main]  
    (Nutch Protocol)}}
{{2024-11-03 16:07:23,341 INFO org.apache.nutch.plugin.PluginRepository [main]  
    (Nutch URL Ignore Exemption Filter)}}
{{2024-11-03 16:07:23,341 INFO org.apache.nutch.plugin.PluginRepository [main]  
    (Nutch Index Writer)}}
{{2024-11-03 16:07:23,341 INFO org.apache.nutch.plugin.PluginRepository [main]  
    (Nutch Segment Merge Filter)}}
{{2024-11-03 16:07:23,341 INFO org.apache.nutch.plugin.PluginRepository [main]  
    (Nutch Indexing Filter)}}
{{2024-11-03 16:07:23,344 INFO org.apache.nutch.parse.ParserChecker [main] 
fetching: smb//[email protected]/Documents/Hiran/MyDocument.pdf}}
{{Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: 
java.net.MalformedURLException: no protocol: 
smb//[email protected]/Documents/Hiran/MyDocument.pdf}}
{{    at 
org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113)}}
{{    at 
org.apache.nutch.util.AbstractChecker.getProtocolOutput(AbstractChecker.java:196)}}
{{    at org.apache.nutch.parse.ParserChecker.process(ParserChecker.java:186)}}
{{    at 
org.apache.nutch.util.AbstractChecker.processSingle(AbstractChecker.java:86)}}
{{    at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:150)}}
{{    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}}
{{    at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:308)}}

The nutch configuration is correct, all the other tools load plugins and log 
doing so to stdout. With parsechecker there is output that the smb plugin gets 
loaded, and still the smb protocol is unknown. What is happening here?

  was:
So I am running Nutch to scan my Synology NAS via the protocol-smb plugin. The 
scan is running nicely in the background and content in Solr is growing.
To check how far the scanning progressed I try out the crawlcomplete command 
like so:

{{./nutch/runtime/local/bin/nutch crawlcomplete -inputDirs ./crawl/crawldb 
-mode host -outputDir crawl/dump/}}

 

But to my surprise I do not get a dump of the URLs including the fetch status, 
or some statistics with counters but errors related to the unknown smb protocol:

{{2024-10-16 23:02:40,425 INFO org.apache.nutch.util.CrawlCompletionStats 
[main] CrawlCompletionStats: starting}}
{{2024-10-16 23:02:40,990 ERROR org.apache.nutch.util.CrawlCompletionStats 
[LocalJobRunner Map Task Executor #0|#0] Failed to get host or domain from URL 
smb://[email protected]/Documents: unknown protocol: smb}}
{{2024-10-16 23:02:40,991 ERROR org.apache.nutch.util.CrawlCompletionStats 
[LocalJobRunner Map Task Executor #0|#0] Failed to get host or domain from URL 
smb://[email protected]/Documents/.htaccess: unknown protocol: smb}}

The nutch configuration is correct, all the other tools load plugins and log 
doing so to stdout. With crawlcomplete there is no such output, and the smb 
protocol is unknown. It looks like plugin configuration is completely ignored.


> Parsechecker command does not load plugins
> ------------------------------------------
>
>                 Key: NUTCH-3088
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3088
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.21
>         Environment: Ubuntu 22 LTS
> openjdk version "21.0.4" 2024-07-16
> OpenJDK Runtime Environment (build 21.0.4+7-Ubuntu-1ubuntu222.04)
> OpenJDK 64-Bit Server VM (build 21.0.4+7-Ubuntu-1ubuntu222.04, mixed mode, 
> sharing)
>            Reporter: Hiran Chaudhuri
>            Priority: Major
>
> So I am running Nutch to scan my Synology NAS via the protocol-smb plugin. 
> The scan is running nicely in the background and content in Solr is growing.
> But the parsing phase throws a lot of exceptions. To check what may be wrong 
> with those URLs I run the parsechecker like so:
> {{./nutch/runtime/local/bin/nutch parsechecker 
> "smb//[email protected]/Documents/Hiran/MyDocument.pdf"}}
>  
> But to my surprise I do not get the same parse exceptiojn but an error 
> related to the unknown smb protocol:
> {{2024-11-03 16:07:23,269 INFO org.apache.nutch.plugin.PluginManifestParser 
> [main] Plugins: looking in: 
> /home/hiran/NetBeansProjects/nutch/runtime/local/plugins}}
> {{2024-11-03 16:07:23,339 INFO org.apache.nutch.plugin.PluginRepository 
> [main] Plugin Auto-activation mode: [true]}}
> {{2024-11-03 16:07:23,339 INFO org.apache.nutch.plugin.PluginRepository 
> [main] Registered Plugins:}}
> {{2024-11-03 16:07:23,339 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Regex URL Filter (urlfilter-regex)}}
> {{2024-11-03 16:07:23,339 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Html Parse Plug-in (parse-html)}}
> {{2024-11-03 16:07:23,339 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     the nutch core extension points (nutch-extensionpoints)}}
> {{2024-11-03 16:07:23,339 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Basic Indexing Filter (index-basic)}}
> {{2024-11-03 16:07:23,339 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Anchor Indexing Filter (index-anchor)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Tika Parser Plug-in (parse-tika)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Index Static (index-static)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Top Level Domain Plugin (tld)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Regex URL Filter Framework (lib-regex-filter)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Language Identification Parser/Filter (language-identifier)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Regex URL Normalizer (urlnormalizer-regex)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     CyberNeko HTML Parser (lib-nekohtml)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Subcollection indexing and query filter (subcollection)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     URL Meta Indexing Filter (urlmeta)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     OPIC Scoring Plug-in (scoring-opic)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Pass-through URL Normalizer (urlnormalizer-pass)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     SMB Protocol based on https://github.com/hierynomus/smbj 
> (protocol-smb)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     More Indexing Filter (index-more)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     SolrIndexWriter (indexer-solr)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Creative Commons Plugins (creativecommons)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Replace Indexer (index-replace)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main] Registered Extension-Points:}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Content Parser)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch URL Filter)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (HTML Parse Filter)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Scoring)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch URL Normalizer)}}
> {{2024-11-03 16:07:23,341 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Publisher)}}
> {{2024-11-03 16:07:23,341 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Exchange)}}
> {{2024-11-03 16:07:23,341 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Protocol)}}
> {{2024-11-03 16:07:23,341 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch URL Ignore Exemption Filter)}}
> {{2024-11-03 16:07:23,341 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Index Writer)}}
> {{2024-11-03 16:07:23,341 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Segment Merge Filter)}}
> {{2024-11-03 16:07:23,341 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Indexing Filter)}}
> {{2024-11-03 16:07:23,344 INFO org.apache.nutch.parse.ParserChecker [main] 
> fetching: smb//[email protected]/Documents/Hiran/MyDocument.pdf}}
> {{Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: 
> java.net.MalformedURLException: no protocol: 
> smb//[email protected]/Documents/Hiran/MyDocument.pdf}}
> {{    at 
> org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113)}}
> {{    at 
> org.apache.nutch.util.AbstractChecker.getProtocolOutput(AbstractChecker.java:196)}}
> {{    at 
> org.apache.nutch.parse.ParserChecker.process(ParserChecker.java:186)}}
> {{    at 
> org.apache.nutch.util.AbstractChecker.processSingle(AbstractChecker.java:86)}}
> {{    at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:150)}}
> {{    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}}
> {{    at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:308)}}
> The nutch configuration is correct, all the other tools load plugins and log 
> doing so to stdout. With parsechecker there is output that the smb plugin 
> gets loaded, and still the smb protocol is unknown. What is happening here?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to