[ https://issues.apache.org/jira/browse/NUTCH-3088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17895125#comment-17895125 ]
Hiran Chaudhuri commented on NUTCH-3088: ---------------------------------------- My bad: The url was missing a colon. Please ignore. > Parsechecker command does not use protocol plugins > -------------------------------------------------- > > Key: NUTCH-3088 > URL: https://issues.apache.org/jira/browse/NUTCH-3088 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.21 > Environment: Ubuntu 22 LTS > openjdk version "21.0.4" 2024-07-16 > OpenJDK Runtime Environment (build 21.0.4+7-Ubuntu-1ubuntu222.04) > OpenJDK 64-Bit Server VM (build 21.0.4+7-Ubuntu-1ubuntu222.04, mixed mode, > sharing) > Reporter: Hiran Chaudhuri > Priority: Major > > So I am running Nutch to scan my Synology NAS via the protocol-smb plugin. > The scan is running nicely in the background and content in Solr is growing. > But the parsing phase throws a lot of exceptions. To check what may be wrong > with those URLs I run the parsechecker like so: > {{./nutch/runtime/local/bin/nutch parsechecker > "smb//hi...@nas.fritz.box/Documents/Hiran/MyDocument.pdf"}} > > But to my surprise I do not get the same parse exceptiojn but an error > related to the unknown smb protocol: > {{2024-11-03 16:07:23,269 INFO org.apache.nutch.plugin.PluginManifestParser > [main] Plugins: looking in: > /home/hiran/NetBeansProjects/nutch/runtime/local/plugins}} > {{2024-11-03 16:07:23,339 INFO org.apache.nutch.plugin.PluginRepository > [main] Plugin Auto-activation mode: [true]}} > {{2024-11-03 16:07:23,339 INFO org.apache.nutch.plugin.PluginRepository > [main] Registered Plugins:}} > {{2024-11-03 16:07:23,339 INFO org.apache.nutch.plugin.PluginRepository > [main] Regex URL Filter (urlfilter-regex)}} > {{2024-11-03 16:07:23,339 INFO org.apache.nutch.plugin.PluginRepository > [main] Html Parse Plug-in (parse-html)}} > {{2024-11-03 16:07:23,339 INFO org.apache.nutch.plugin.PluginRepository > [main] the nutch core extension points (nutch-extensionpoints)}} > {{2024-11-03 16:07:23,339 INFO org.apache.nutch.plugin.PluginRepository > [main] Basic Indexing Filter (index-basic)}} > {{2024-11-03 16:07:23,339 INFO org.apache.nutch.plugin.PluginRepository > [main] Anchor Indexing Filter (index-anchor)}} > {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository > [main] Tika Parser Plug-in (parse-tika)}} > {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository > [main] Index Static (index-static)}} > {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository > [main] Top Level Domain Plugin (tld)}} > {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository > [main] Regex URL Filter Framework (lib-regex-filter)}} > {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository > [main] Language Identification Parser/Filter (language-identifier)}} > {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository > [main] Regex URL Normalizer (urlnormalizer-regex)}} > {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository > [main] CyberNeko HTML Parser (lib-nekohtml)}} > {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository > [main] Subcollection indexing and query filter (subcollection)}} > {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository > [main] URL Meta Indexing Filter (urlmeta)}} > {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository > [main] OPIC Scoring Plug-in (scoring-opic)}} > {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository > [main] Pass-through URL Normalizer (urlnormalizer-pass)}} > {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository > [main] SMB Protocol based on https://github.com/hierynomus/smbj > (protocol-smb)}} > {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository > [main] More Indexing Filter (index-more)}} > {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository > [main] SolrIndexWriter (indexer-solr)}} > {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository > [main] Creative Commons Plugins (creativecommons)}} > {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository > [main] Replace Indexer (index-replace)}} > {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository > [main] Registered Extension-Points:}} > {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch Content Parser)}} > {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch URL Filter)}} > {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository > [main] (HTML Parse Filter)}} > {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch Scoring)}} > {{2024-11-03 16:07:23,340 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch URL Normalizer)}} > {{2024-11-03 16:07:23,341 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch Publisher)}} > {{2024-11-03 16:07:23,341 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch Exchange)}} > {{2024-11-03 16:07:23,341 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch Protocol)}} > {{2024-11-03 16:07:23,341 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch URL Ignore Exemption Filter)}} > {{2024-11-03 16:07:23,341 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch Index Writer)}} > {{2024-11-03 16:07:23,341 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch Segment Merge Filter)}} > {{2024-11-03 16:07:23,341 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch Indexing Filter)}} > {{2024-11-03 16:07:23,344 INFO org.apache.nutch.parse.ParserChecker [main] > fetching: smb//hi...@nas.fritz.box/Documents/Hiran/MyDocument.pdf}} > {{Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: > java.net.MalformedURLException: no protocol: > smb//hi...@nas.fritz.box/Documents/Hiran/MyDocument.pdf}} > {{ at > org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113)}} > {{ at > org.apache.nutch.util.AbstractChecker.getProtocolOutput(AbstractChecker.java:196)}} > {{ at > org.apache.nutch.parse.ParserChecker.process(ParserChecker.java:186)}} > {{ at > org.apache.nutch.util.AbstractChecker.processSingle(AbstractChecker.java:86)}} > {{ at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:150)}} > {{ at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}} > {{ at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:308)}} > The nutch configuration is correct, all the other tools load plugins and log > doing so to stdout. With parsechecker there is output that the smb plugin > gets loaded, and still the smb protocol is unknown. What is happening here? -- This message was sent by Atlassian Jira (v8.20.10#820010)