[
https://issues.apache.org/jira/browse/NUTCH-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891193#comment-17891193
]
Hiran Chaudhuri commented on NUTCH-3075:
----------------------------------------
I created [https://github.com/apache/nutch/pull/830] which would improve the
blunt error message, at least in local mode.
> tld plugin makes injector crash
> -------------------------------
>
> Key: NUTCH-3075
> URL: https://issues.apache.org/jira/browse/NUTCH-3075
> Project: Nutch
> Issue Type: Bug
> Components: injector
> Affects Versions: 1.21
> Environment: * Ubuntu 22 LTS
> * openjdk version "21.0.4" 2024-07-16 LTS
> Reporter: Hiran Chaudhuri
> Priority: Major
>
> I cloned the current master branch (commit id
> d6f55b8ea6f5809cef5a31239e5760be23742c00) which nicely compiles to
> apache-nutch-1.21-SNAPSHOT.job
> Even after I added my own protocol-imap implementation. Crawling works to
> some degree - I am heavily experimenting with IMAP and the data I receive in
> Solr. Looking at the
> [IndexStructure|https://cwiki.apache.org/confluence/display/NUTCH/IndexStructure]
> I hoped to get better information by adding all the mentioned plugins.
> Thus I reconfigured nutch-site.xml, especially the `plugin.includes` property
> to include them all. As soon as `tld` is contained, upon seeding my CrawlDb
> the injector dies with
> {{WARN StatusConsoleListener The use of package scanning to locate plugins is
> deprecated and will be removed in a future release}}
> {{WARN StatusConsoleListener The use of package scanning to locate plugins is
> deprecated and will be removed in a future release}}
> {{WARN StatusConsoleListener The use of package scanning to locate plugins is
> deprecated and will be removed in a future release}}
> {{WARN StatusConsoleListener The use of package scanning to locate plugins is
> deprecated and will be removed in a future release}}
> {{2024-10-11 23:27:51,295 INFO org.apache.nutch.plugin.PluginManifestParser
> [main] Plugins: looking in:
> /home/hiran/NetBeansProjects/nutch/runtime/local/plugins}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository
> [main] Plugin Auto-activation mode: [true]}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository
> [main] Registered Plugins:}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository
> [main] the nutch core extension points (nutch-extensionpoints)}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository
> [main] Top Level Domain Plugin (tld)}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository
> [main] IMAP Protocol Plug-in (protocol-imap)}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository
> [main] Registered Extension-Points:}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository
> [main] (Nutch Content Parser)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository
> [main] (Nutch URL Filter)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository
> [main] (HTML Parse Filter)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository
> [main] (Nutch Scoring)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository
> [main] (Nutch URL Normalizer)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository
> [main] (Nutch Publisher)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository
> [main] (Nutch Exchange)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository
> [main] (Nutch Protocol)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository
> [main] (Nutch URL Ignore Exemption Filter)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository
> [main] (Nutch Index Writer)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository
> [main] (Nutch Segment Merge Filter)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository
> [main] (Nutch Indexing Filter)}}
> {{2024-10-11 23:27:51,368 INFO org.apache.nutch.crawl.Injector [main]
> Injector: starting}}
> {{2024-10-11 23:27:51,368 INFO org.apache.nutch.crawl.Injector [main]
> Injector: crawlDb: crawl/crawldb}}
> {{2024-10-11 23:27:51,369 INFO org.apache.nutch.crawl.Injector [main]
> Injector: urlDir: urls}}
> {{2024-10-11 23:27:51,369 INFO org.apache.nutch.crawl.Injector [main]
> Injector: Converting injected urls to crawl db entries.}}
> {{2024-10-11 23:27:51,519 INFO org.apache.nutch.crawl.Injector [main]
> Injecting seed URL file
> file:/home/hiran/NetBeansProjects/nutch/urls/seed.txt}}
> {{2024-10-11 23:27:52,778 ERROR org.apache.nutch.crawl.Injector [main]
> Injector job did not succeed, job id: job_local1500911141_0001, job status:
> FAILED, reason: NA}}
> {{2024-10-11 23:27:52,779 ERROR org.apache.nutch.crawl.Injector [main]
> Injector: java.lang.RuntimeException: Injector job did not succeed, job id:
> job_local1500911141_0001, job status: FAILED, reason: NA}}
> {{ at org.apache.nutch.crawl.Injector.inject(Injector.java:446)}}
> {{ at org.apache.nutch.crawl.Injector.run(Injector.java:574)}}
> {{ at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}}
> {{ at org.apache.nutch.crawl.Injector.main(Injector.java:538)}}
>
> The behaviour can be cured by simply removing `tld` from the property.
>
> * Could there be some better error message?
> * Why does the tld plugin crash the injector phase at all?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)