[ https://issues.apache.org/jira/browse/NUTCH-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17889215#comment-17889215 ]
Markus Jelsma commented on NUTCH-3075: -------------------------------------- Please check the logs. The error Nutch gives you back on the command line does not contain information about the reason of the failure. Those are usually only found in the actual mapper or reducer logs on workers nodes, or in case of running Nutch locally, in the one big log file. > tld plugin makes injector crash > ------------------------------- > > Key: NUTCH-3075 > URL: https://issues.apache.org/jira/browse/NUTCH-3075 > Project: Nutch > Issue Type: Bug > Components: injector > Affects Versions: 1.21 > Environment: * Ubuntu 22 LTS > * openjdk version "21.0.4" 2024-07-16 LTS > Reporter: Hiran Chaudhuri > Priority: Major > > I cloned the current master branch (commit id > d6f55b8ea6f5809cef5a31239e5760be23742c00) which nicely compiles to > apache-nutch-1.21-SNAPSHOT.job > Even after I added my own protocol-imap implementation. Crawling works to > some degree - I am heavily experimenting with IMAP and the data I receive in > Solr. Looking at the > [IndexStructure|https://cwiki.apache.org/confluence/display/NUTCH/IndexStructure] > I hoped to get better information by adding all the mentioned plugins. > Thus I reconfigured nutch-site.xml, especially the `plugin.includes` property > to include them all. As soon as `tld` is contained, upon seeding my CrawlDb > the injector dies with > {{WARN StatusConsoleListener The use of package scanning to locate plugins is > deprecated and will be removed in a future release}} > {{WARN StatusConsoleListener The use of package scanning to locate plugins is > deprecated and will be removed in a future release}} > {{WARN StatusConsoleListener The use of package scanning to locate plugins is > deprecated and will be removed in a future release}} > {{WARN StatusConsoleListener The use of package scanning to locate plugins is > deprecated and will be removed in a future release}} > {{2024-10-11 23:27:51,295 INFO org.apache.nutch.plugin.PluginManifestParser > [main] Plugins: looking in: > /home/hiran/NetBeansProjects/nutch/runtime/local/plugins}} > {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository > [main] Plugin Auto-activation mode: [true]}} > {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository > [main] Registered Plugins:}} > {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository > [main] the nutch core extension points (nutch-extensionpoints)}} > {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository > [main] Top Level Domain Plugin (tld)}} > {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository > [main] IMAP Protocol Plug-in (protocol-imap)}} > {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository > [main] Registered Extension-Points:}} > {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch Content Parser)}} > {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch URL Filter)}} > {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository > [main] (HTML Parse Filter)}} > {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch Scoring)}} > {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch URL Normalizer)}} > {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch Publisher)}} > {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch Exchange)}} > {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch Protocol)}} > {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch URL Ignore Exemption Filter)}} > {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch Index Writer)}} > {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch Segment Merge Filter)}} > {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch Indexing Filter)}} > {{2024-10-11 23:27:51,368 INFO org.apache.nutch.crawl.Injector [main] > Injector: starting}} > {{2024-10-11 23:27:51,368 INFO org.apache.nutch.crawl.Injector [main] > Injector: crawlDb: crawl/crawldb}} > {{2024-10-11 23:27:51,369 INFO org.apache.nutch.crawl.Injector [main] > Injector: urlDir: urls}} > {{2024-10-11 23:27:51,369 INFO org.apache.nutch.crawl.Injector [main] > Injector: Converting injected urls to crawl db entries.}} > {{2024-10-11 23:27:51,519 INFO org.apache.nutch.crawl.Injector [main] > Injecting seed URL file > file:/home/hiran/NetBeansProjects/nutch/urls/seed.txt}} > {{2024-10-11 23:27:52,778 ERROR org.apache.nutch.crawl.Injector [main] > Injector job did not succeed, job id: job_local1500911141_0001, job status: > FAILED, reason: NA}} > {{2024-10-11 23:27:52,779 ERROR org.apache.nutch.crawl.Injector [main] > Injector: java.lang.RuntimeException: Injector job did not succeed, job id: > job_local1500911141_0001, job status: FAILED, reason: NA}} > {{ at org.apache.nutch.crawl.Injector.inject(Injector.java:446)}} > {{ at org.apache.nutch.crawl.Injector.run(Injector.java:574)}} > {{ at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}} > {{ at org.apache.nutch.crawl.Injector.main(Injector.java:538)}} > > The behaviour can be cured by simply removing `tld` from the property. > > * Could there be some better error message? > * Why does the tld plugin crash the injector phase at all? -- This message was sent by Atlassian Jira (v8.20.10#820010)