Hiran Chaudhuri created NUTCH-3075:
--------------------------------------

             Summary: tld plugin makes injector crash
                 Key: NUTCH-3075
                 URL: https://issues.apache.org/jira/browse/NUTCH-3075
             Project: Nutch
          Issue Type: Bug
          Components: injector
    Affects Versions: 1.21
         Environment: * Ubuntu 22 LTS
 * openjdk version "21.0.4" 2024-07-16 LTS
            Reporter: Hiran Chaudhuri


I cloned the current master branch (commit id 
d6f55b8ea6f5809cef5a31239e5760be23742c00) which nicely compiles to 
apache-nutch-1.21-SNAPSHOT.job

Even after I added my own protocol-imap implementation. Crawling works to some 
degree - I am heavily experimenting with IMAP and the data I receive in Solr. 
Looking at the 
[IndexStructure|https://cwiki.apache.org/confluence/display/NUTCH/IndexStructure]
 I hoped to get better information by adding all the mentioned plugins.

Thus I reconfigured nutch-site.xml, especially the `plugin.includes` property 
to include them all. As soon as `tld` is contained, upon seeding my CrawlDb the 
injector dies with

{{2024-10-11 23:27:51,519 INFO org.apache.nutch.crawl.Injector [main] Injecting 
seed URL file file:/home/hiran/NetBeansProjects/nutch/urls/seed.txt}}
{{2024-10-11 23:27:52,778 ERROR org.apache.nutch.crawl.Injector [main] Injector 
job did not succeed, job id: job_local1500911141_0001, job status: FAILED, 
reason: NA}}
{{2024-10-11 23:27:52,779 ERROR org.apache.nutch.crawl.Injector [main] 
Injector: java.lang.RuntimeException: Injector job did not succeed, job id: 
job_local1500911141_0001, job status: FAILED, reason: NA}}
{{    at org.apache.nutch.crawl.Injector.inject(Injector.java:446)}}
{{    at org.apache.nutch.crawl.Injector.run(Injector.java:574)}}
{{    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}}
{{    at org.apache.nutch.crawl.Injector.main(Injector.java:538)}}

 

The behaviour can be cured by simply removing `tld` from the property.

 
 * Could there be some better error message?
 * Why does the tld plugin crash the injector phase at all?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to