[ 
https://issues.apache.org/jira/browse/NUTCH-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891230#comment-17891230
 ] 

Sebastian Nagel commented on NUTCH-3075:
----------------------------------------

Hi [~hiranchaudhuri], thanks for figuring out the reason of the error!

The error is my fault, it relates to NUTCH-1942. Fix/PR is ready.

I'm not sure about catching exceptions in the setup methods:
- of course, logging them helps to understand the error in local mode more 
quickly
- but there are about 50 job, mapper and reducer implementations, all 
implementing/overriding the setup method. Do we want to change them all? It's 
more about a clear documentation that in case of errors the hadoop.log (or the 
task logs if running in distributed mode) needs to be consulted.

> tld plugin makes injector crash
> -------------------------------
>
>                 Key: NUTCH-3075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3075
>             Project: Nutch
>          Issue Type: Bug
>          Components: injector
>    Affects Versions: 1.21
>         Environment: * Ubuntu 22 LTS
>  * openjdk version "21.0.4" 2024-07-16 LTS
>            Reporter: Hiran Chaudhuri
>            Priority: Major
>
> I cloned the current master branch (commit id 
> d6f55b8ea6f5809cef5a31239e5760be23742c00) which nicely compiles to 
> apache-nutch-1.21-SNAPSHOT.job
> Even after I added my own protocol-imap implementation. Crawling works to 
> some degree - I am heavily experimenting with IMAP and the data I receive in 
> Solr. Looking at the 
> [IndexStructure|https://cwiki.apache.org/confluence/display/NUTCH/IndexStructure]
>  I hoped to get better information by adding all the mentioned plugins.
> Thus I reconfigured nutch-site.xml, especially the `plugin.includes` property 
> to include them all. As soon as `tld` is contained, upon seeding my CrawlDb 
> the injector dies with
> {{WARN StatusConsoleListener The use of package scanning to locate plugins is 
> deprecated and will be removed in a future release}}
> {{WARN StatusConsoleListener The use of package scanning to locate plugins is 
> deprecated and will be removed in a future release}}
> {{WARN StatusConsoleListener The use of package scanning to locate plugins is 
> deprecated and will be removed in a future release}}
> {{WARN StatusConsoleListener The use of package scanning to locate plugins is 
> deprecated and will be removed in a future release}}
> {{2024-10-11 23:27:51,295 INFO org.apache.nutch.plugin.PluginManifestParser 
> [main] Plugins: looking in: 
> /home/hiran/NetBeansProjects/nutch/runtime/local/plugins}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main] Plugin Auto-activation mode: [true]}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main] Registered Plugins:}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     the nutch core extension points (nutch-extensionpoints)}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Top Level Domain Plugin (tld)}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     IMAP Protocol Plug-in (protocol-imap)}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main] Registered Extension-Points:}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Content Parser)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch URL Filter)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (HTML Parse Filter)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Scoring)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch URL Normalizer)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Publisher)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Exchange)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Protocol)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch URL Ignore Exemption Filter)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Index Writer)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Segment Merge Filter)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Indexing Filter)}}
> {{2024-10-11 23:27:51,368 INFO org.apache.nutch.crawl.Injector [main] 
> Injector: starting}}
> {{2024-10-11 23:27:51,368 INFO org.apache.nutch.crawl.Injector [main] 
> Injector: crawlDb: crawl/crawldb}}
> {{2024-10-11 23:27:51,369 INFO org.apache.nutch.crawl.Injector [main] 
> Injector: urlDir: urls}}
> {{2024-10-11 23:27:51,369 INFO org.apache.nutch.crawl.Injector [main] 
> Injector: Converting injected urls to crawl db entries.}}
> {{2024-10-11 23:27:51,519 INFO org.apache.nutch.crawl.Injector [main] 
> Injecting seed URL file 
> file:/home/hiran/NetBeansProjects/nutch/urls/seed.txt}}
> {{2024-10-11 23:27:52,778 ERROR org.apache.nutch.crawl.Injector [main] 
> Injector job did not succeed, job id: job_local1500911141_0001, job status: 
> FAILED, reason: NA}}
> {{2024-10-11 23:27:52,779 ERROR org.apache.nutch.crawl.Injector [main] 
> Injector: java.lang.RuntimeException: Injector job did not succeed, job id: 
> job_local1500911141_0001, job status: FAILED, reason: NA}}
> {{    at org.apache.nutch.crawl.Injector.inject(Injector.java:446)}}
> {{    at org.apache.nutch.crawl.Injector.run(Injector.java:574)}}
> {{    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}}
> {{    at org.apache.nutch.crawl.Injector.main(Injector.java:538)}}
>  
> The behaviour can be cured by simply removing `tld` from the property.
>  
>  * Could there be some better error message?
>  * Why does the tld plugin crash the injector phase at all?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to