[ https://issues.apache.org/jira/browse/NUTCH-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891812#comment-17891812 ]
Sebastian Nagel commented on NUTCH-3075: ---------------------------------------- Hi [~hiranchaudhuri], would you mind opening a new issue for improvements of the logging in case jobs fail? I think it's better to keep in this issue focused on the core bug, the broken plugin.xml of the "tld" plugin. > tld plugin makes injector crash > ------------------------------- > > Key: NUTCH-3075 > URL: https://issues.apache.org/jira/browse/NUTCH-3075 > Project: Nutch > Issue Type: Bug > Components: injector > Affects Versions: 1.21 > Environment: * Ubuntu 22 LTS > * openjdk version "21.0.4" 2024-07-16 LTS > Reporter: Hiran Chaudhuri > Priority: Major > Fix For: 1.21 > > > I cloned the current master branch (commit id > d6f55b8ea6f5809cef5a31239e5760be23742c00) which nicely compiles to > apache-nutch-1.21-SNAPSHOT.job > Even after I added my own protocol-imap implementation. Crawling works to > some degree - I am heavily experimenting with IMAP and the data I receive in > Solr. Looking at the > [IndexStructure|https://cwiki.apache.org/confluence/display/NUTCH/IndexStructure] > I hoped to get better information by adding all the mentioned plugins. > Thus I reconfigured nutch-site.xml, especially the `plugin.includes` property > to include them all. As soon as `tld` is contained, upon seeding my CrawlDb > the injector dies with > {{WARN StatusConsoleListener The use of package scanning to locate plugins is > deprecated and will be removed in a future release}} > {{WARN StatusConsoleListener The use of package scanning to locate plugins is > deprecated and will be removed in a future release}} > {{WARN StatusConsoleListener The use of package scanning to locate plugins is > deprecated and will be removed in a future release}} > {{WARN StatusConsoleListener The use of package scanning to locate plugins is > deprecated and will be removed in a future release}} > {{2024-10-11 23:27:51,295 INFO org.apache.nutch.plugin.PluginManifestParser > [main] Plugins: looking in: > /home/hiran/NetBeansProjects/nutch/runtime/local/plugins}} > {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository > [main] Plugin Auto-activation mode: [true]}} > {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository > [main] Registered Plugins:}} > {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository > [main] the nutch core extension points (nutch-extensionpoints)}} > {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository > [main] Top Level Domain Plugin (tld)}} > {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository > [main] IMAP Protocol Plug-in (protocol-imap)}} > {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository > [main] Registered Extension-Points:}} > {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch Content Parser)}} > {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch URL Filter)}} > {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository > [main] (HTML Parse Filter)}} > {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch Scoring)}} > {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch URL Normalizer)}} > {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch Publisher)}} > {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch Exchange)}} > {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch Protocol)}} > {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch URL Ignore Exemption Filter)}} > {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch Index Writer)}} > {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch Segment Merge Filter)}} > {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository > [main] (Nutch Indexing Filter)}} > {{2024-10-11 23:27:51,368 INFO org.apache.nutch.crawl.Injector [main] > Injector: starting}} > {{2024-10-11 23:27:51,368 INFO org.apache.nutch.crawl.Injector [main] > Injector: crawlDb: crawl/crawldb}} > {{2024-10-11 23:27:51,369 INFO org.apache.nutch.crawl.Injector [main] > Injector: urlDir: urls}} > {{2024-10-11 23:27:51,369 INFO org.apache.nutch.crawl.Injector [main] > Injector: Converting injected urls to crawl db entries.}} > {{2024-10-11 23:27:51,519 INFO org.apache.nutch.crawl.Injector [main] > Injecting seed URL file > file:/home/hiran/NetBeansProjects/nutch/urls/seed.txt}} > {{2024-10-11 23:27:52,778 ERROR org.apache.nutch.crawl.Injector [main] > Injector job did not succeed, job id: job_local1500911141_0001, job status: > FAILED, reason: NA}} > {{2024-10-11 23:27:52,779 ERROR org.apache.nutch.crawl.Injector [main] > Injector: java.lang.RuntimeException: Injector job did not succeed, job id: > job_local1500911141_0001, job status: FAILED, reason: NA}} > {{ at org.apache.nutch.crawl.Injector.inject(Injector.java:446)}} > {{ at org.apache.nutch.crawl.Injector.run(Injector.java:574)}} > {{ at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}} > {{ at org.apache.nutch.crawl.Injector.main(Injector.java:538)}} > > The behaviour can be cured by simply removing `tld` from the property. > > * Could there be some better error message? > * Why does the tld plugin crash the injector phase at all? -- This message was sent by Atlassian Jira (v8.20.10#820010)