Hiran Chaudhuri created NUTCH-3075:
--------------------------------------
Summary: tld plugin makes injector crash
Key: NUTCH-3075
URL: https://issues.apache.org/jira/browse/NUTCH-3075
Project: Nutch
Issue Type: Bug
Components: injector
Affects Versions: 1.21
Environment: * Ubuntu 22 LTS
* openjdk version "21.0.4" 2024-07-16 LTS
Reporter: Hiran Chaudhuri
I cloned the current master branch (commit id
d6f55b8ea6f5809cef5a31239e5760be23742c00) which nicely compiles to
apache-nutch-1.21-SNAPSHOT.job
Even after I added my own protocol-imap implementation. Crawling works to some
degree - I am heavily experimenting with IMAP and the data I receive in Solr.
Looking at the
[IndexStructure|https://cwiki.apache.org/confluence/display/NUTCH/IndexStructure]
I hoped to get better information by adding all the mentioned plugins.
Thus I reconfigured nutch-site.xml, especially the `plugin.includes` property
to include them all. As soon as `tld` is contained, upon seeding my CrawlDb the
injector dies with
{{2024-10-11 23:27:51,519 INFO org.apache.nutch.crawl.Injector [main] Injecting
seed URL file file:/home/hiran/NetBeansProjects/nutch/urls/seed.txt}}
{{2024-10-11 23:27:52,778 ERROR org.apache.nutch.crawl.Injector [main] Injector
job did not succeed, job id: job_local1500911141_0001, job status: FAILED,
reason: NA}}
{{2024-10-11 23:27:52,779 ERROR org.apache.nutch.crawl.Injector [main]
Injector: java.lang.RuntimeException: Injector job did not succeed, job id:
job_local1500911141_0001, job status: FAILED, reason: NA}}
{{ at org.apache.nutch.crawl.Injector.inject(Injector.java:446)}}
{{ at org.apache.nutch.crawl.Injector.run(Injector.java:574)}}
{{ at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}}
{{ at org.apache.nutch.crawl.Injector.main(Injector.java:538)}}
The behaviour can be cured by simply removing `tld` from the property.
* Could there be some better error message?
* Why does the tld plugin crash the injector phase at all?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)