[ 
https://issues.apache.org/jira/browse/NUTCH-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891232#comment-17891232
 ] 

Hiran Chaudhuri edited comment on NUTCH-3075 at 10/20/24 9:17 AM:
------------------------------------------------------------------

{quote}in case of errors the hadoop.log (or the task logs if running in 
distributed mode) needs to be consulted.
{quote}
Yes, that's what I tried to point out. This is even generic enough it can be 
printed when the job fails - so even those who miss looking at the 
documentation would get the hint.

Looking at

[http://apache.github.io/hadoop/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Task_Logs]

it seems the task logs are in {{{}${HADOOP_LOG_DIR}/userlogs{}}}. With that the 
error message could say

{{{}Also check the logs in hadoop.log or \{${HADOOP_LOG_DIR}/userlogs{}}}.

The variable should automatically get resolved so users do not have to 
guesstimate.

 


was (Author: hiranchaudhuri):
{quote}in case of errors the hadoop.log (or the task logs if running in 
distributed mode) needs to be consulted.
{quote}
Yes, that's what I tried to point out. This is even generic enough it can be 
printed when the job fails - so even those who miss looking at the 
documentation would get the hint.

Looking at

[http://apache.github.io/hadoop/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Task_Logs]

it seems the task logs are in {{{}${HADOOP_LOG_DIR}/userlogs{}}}. With that the 
error message could say

{{Also check the logs in hadoop.log or {{{}${HADOOP_LOG_DIR}/userlogs{}}}.}}

The variable should automatically get resolved so users do not have to 
guesstimate.

 

> tld plugin makes injector crash
> -------------------------------
>
>                 Key: NUTCH-3075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3075
>             Project: Nutch
>          Issue Type: Bug
>          Components: injector
>    Affects Versions: 1.21
>         Environment: * Ubuntu 22 LTS
>  * openjdk version "21.0.4" 2024-07-16 LTS
>            Reporter: Hiran Chaudhuri
>            Priority: Major
>             Fix For: 1.21
>
>
> I cloned the current master branch (commit id 
> d6f55b8ea6f5809cef5a31239e5760be23742c00) which nicely compiles to 
> apache-nutch-1.21-SNAPSHOT.job
> Even after I added my own protocol-imap implementation. Crawling works to 
> some degree - I am heavily experimenting with IMAP and the data I receive in 
> Solr. Looking at the 
> [IndexStructure|https://cwiki.apache.org/confluence/display/NUTCH/IndexStructure]
>  I hoped to get better information by adding all the mentioned plugins.
> Thus I reconfigured nutch-site.xml, especially the `plugin.includes` property 
> to include them all. As soon as `tld` is contained, upon seeding my CrawlDb 
> the injector dies with
> {{WARN StatusConsoleListener The use of package scanning to locate plugins is 
> deprecated and will be removed in a future release}}
> {{WARN StatusConsoleListener The use of package scanning to locate plugins is 
> deprecated and will be removed in a future release}}
> {{WARN StatusConsoleListener The use of package scanning to locate plugins is 
> deprecated and will be removed in a future release}}
> {{WARN StatusConsoleListener The use of package scanning to locate plugins is 
> deprecated and will be removed in a future release}}
> {{2024-10-11 23:27:51,295 INFO org.apache.nutch.plugin.PluginManifestParser 
> [main] Plugins: looking in: 
> /home/hiran/NetBeansProjects/nutch/runtime/local/plugins}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main] Plugin Auto-activation mode: [true]}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main] Registered Plugins:}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     the nutch core extension points (nutch-extensionpoints)}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     Top Level Domain Plugin (tld)}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main]     IMAP Protocol Plug-in (protocol-imap)}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main] Registered Extension-Points:}}
> {{2024-10-11 23:27:51,366 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Content Parser)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch URL Filter)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (HTML Parse Filter)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Scoring)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch URL Normalizer)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Publisher)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Exchange)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Protocol)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch URL Ignore Exemption Filter)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Index Writer)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Segment Merge Filter)}}
> {{2024-10-11 23:27:51,367 INFO org.apache.nutch.plugin.PluginRepository 
> [main]      (Nutch Indexing Filter)}}
> {{2024-10-11 23:27:51,368 INFO org.apache.nutch.crawl.Injector [main] 
> Injector: starting}}
> {{2024-10-11 23:27:51,368 INFO org.apache.nutch.crawl.Injector [main] 
> Injector: crawlDb: crawl/crawldb}}
> {{2024-10-11 23:27:51,369 INFO org.apache.nutch.crawl.Injector [main] 
> Injector: urlDir: urls}}
> {{2024-10-11 23:27:51,369 INFO org.apache.nutch.crawl.Injector [main] 
> Injector: Converting injected urls to crawl db entries.}}
> {{2024-10-11 23:27:51,519 INFO org.apache.nutch.crawl.Injector [main] 
> Injecting seed URL file 
> file:/home/hiran/NetBeansProjects/nutch/urls/seed.txt}}
> {{2024-10-11 23:27:52,778 ERROR org.apache.nutch.crawl.Injector [main] 
> Injector job did not succeed, job id: job_local1500911141_0001, job status: 
> FAILED, reason: NA}}
> {{2024-10-11 23:27:52,779 ERROR org.apache.nutch.crawl.Injector [main] 
> Injector: java.lang.RuntimeException: Injector job did not succeed, job id: 
> job_local1500911141_0001, job status: FAILED, reason: NA}}
> {{    at org.apache.nutch.crawl.Injector.inject(Injector.java:446)}}
> {{    at org.apache.nutch.crawl.Injector.run(Injector.java:574)}}
> {{    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}}
> {{    at org.apache.nutch.crawl.Injector.main(Injector.java:538)}}
>  
> The behaviour can be cured by simply removing `tld` from the property.
>  
>  * Could there be some better error message?
>  * Why does the tld plugin crash the injector phase at all?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to