[ 
https://issues.apache.org/jira/browse/NUTCH-3088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17895125#comment-17895125
 ] 

Hiran Chaudhuri commented on NUTCH-3088:
----------------------------------------

My bad: The url was missing a colon. Please ignore.

> Parsechecker command does not use protocol plugins
> --------------------------------------------------
>
>                 Key: NUTCH-3088
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3088
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.21
>         Environment: Ubuntu 22 LTS
> openjdk version "21.0.4" 2024-07-16
> OpenJDK Runtime Environment (build 21.0.4+7-Ubuntu-1ubuntu222.04)
> OpenJDK 64-Bit Server VM (build 21.0.4+7-Ubuntu-1ubuntu222.04, mixed mode, 
> sharing)
>            Reporter: Hiran Chaudhuri
>            Priority: Major
>
> So I am running Nutch to scan my Synology NAS via the protocol-smb plugin. 
> The scan is running nicely in the background and content in Solr is growing.
> But the parsing phase throws a lot of exceptions. To check what may be wrong 
> with those URLs I run the parsechecker like so:
> {{./nutch/runtime/local/bin/nutch parsechecker 
> "smb//hi...@nas.fritz.box/Documents/Hiran/MyDocument.pdf"}}
>  
> But to my surprise I do not get the same parse exceptiojn but an error 
> related to the unknown smb protocol:
> {{2024-11-03 16:07:23,269 INFO org.apache.nutch.plugin.PluginManifestParser 
> [main] Plugins: looking in: 
> /home/hiran/NetBeansProjects/nutch/runtime/local/plugins}}
> {{2024-11-03 16:07:23,339 INFO org.apache.nutch.plugin.PluginRepository 
> [main] Plugin Auto-activation mode: [true]}}
> {{2024-11-03 16:07:23,339 INFO org.apache.nutch.plugin.PluginRepository 
> [main] Registered Plugins:}}
> {{2024-11-03 16:07:23,339 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Regex URL Filter (urlfilter-regex)}}
> {{2024-11-03 16:07:23,339 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Html Parse Plug-in (parse-html)}}
> {{2024-11-03 16:07:23,339 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     the nutch core extension points (nutch-extensionpoints)}}
> {{2024-11-03 16:07:23,339 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Basic Indexing Filter (index-basic)}}
> {{2024-11-03 16:07:23,339 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Anchor Indexing Filter (index-anchor)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Tika Parser Plug-in (parse-tika)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Index Static (index-static)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Top Level Domain Plugin (tld)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Regex URL Filter Framework (lib-regex-filter)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Language Identification Parser/Filter (language-identifier)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Regex URL Normalizer (urlnormalizer-regex)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     CyberNeko HTML Parser (lib-nekohtml)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Subcollection indexing and query filter (subcollection)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     URL Meta Indexing Filter (urlmeta)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     OPIC Scoring Plug-in (scoring-opic)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Pass-through URL Normalizer (urlnormalizer-pass)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     SMB Protocol based on https://github.com/hierynomus/smbj 
> (protocol-smb)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     More Indexing Filter (index-more)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     SolrIndexWriter (indexer-solr)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Creative Commons Plugins (creativecommons)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Replace Indexer (index-replace)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main] Registered Extension-Points:}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Content Parser)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch URL Filter)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (HTML Parse Filter)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Scoring)}}
> {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch URL Normalizer)}}
> {{2024-11-03 16:07:23,341 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Publisher)}}
> {{2024-11-03 16:07:23,341 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Exchange)}}
> {{2024-11-03 16:07:23,341 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Protocol)}}
> {{2024-11-03 16:07:23,341 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch URL Ignore Exemption Filter)}}
> {{2024-11-03 16:07:23,341 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Index Writer)}}
> {{2024-11-03 16:07:23,341 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Segment Merge Filter)}}
> {{2024-11-03 16:07:23,341 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Indexing Filter)}}
> {{2024-11-03 16:07:23,344 INFO org.apache.nutch.parse.ParserChecker [main] 
> fetching: smb//hi...@nas.fritz.box/Documents/Hiran/MyDocument.pdf}}
> {{Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: 
> java.net.MalformedURLException: no protocol: 
> smb//hi...@nas.fritz.box/Documents/Hiran/MyDocument.pdf}}
> {{    at 
> org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113)}}
> {{    at 
> org.apache.nutch.util.AbstractChecker.getProtocolOutput(AbstractChecker.java:196)}}
> {{    at 
> org.apache.nutch.parse.ParserChecker.process(ParserChecker.java:186)}}
> {{    at 
> org.apache.nutch.util.AbstractChecker.processSingle(AbstractChecker.java:86)}}
> {{    at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:150)}}
> {{    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}}
> {{    at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:308)}}
> The nutch configuration is correct, all the other tools load plugins and log 
> doing so to stdout. With parsechecker there is output that the smb plugin 
> gets loaded, and still the smb protocol is unknown. What is happening here?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to