[
https://issues.apache.org/jira/browse/NUTCH-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16184721#comment-16184721
]
ASF GitHub Bot commented on NUTCH-2375:
---------------------------------------
sebastian-nagel commented on issue #221: NUTCH-2375 Upgrading nutch to use
org.apache.hadoop.mapreduce
URL: https://github.com/apache/nutch/pull/221#issuecomment-332941883
Hi @Omkar20895,
running a test crawl on a Hadoop cluster failed again. I've got two
ClassNotFoundException-s, first in Generator, in the mapper of the last step
"partitioning":
```
17/09/28 17:38:16 INFO crawl.Generator: Generator: Partitioning selected
urls for politeness.
...
17/09/28 17:54:56 INFO mapreduce.Job: Task Id :
attempt_1505293155476_0250_m_000000_98, Status : FAILED
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
org.apache.nutch.crawl.Generator$SelectorInverseMapper not found
at
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2203)
at
org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:196)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:751)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ClassNotFoundException: Class
org.apache.nutch.crawl.Generator$SelectorInverseMapper not found
at
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2109)
at
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2201)
... 8 more
17/09/28 17:55:07 INFO mapreduce.Job: map 100% reduce 100%
17/09/28 17:55:07 INFO mapreduce.Job: Job job_1505293155476_0250 failed with
state FAILED due to: Task failed task_1505293155476_0250_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0
17/09/28 17:55:07 INFO mapreduce.Job: Counters: 9
Job Counters
Failed map tasks=100
Killed reduce tasks=2
Launched map tasks=100
Other local map tasks=100
Total time spent by all maps in occupied slots (ms)=2073900
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=691300
Total vcore-milliseconds taken by all map tasks=691300
Total megabyte-milliseconds taken by all map tasks=2123673600
17/09/28 17:55:07 INFO crawl.Generator: Generator: finished at 2017-09-28
17:55:07, elapsed: 00:28:26
```
Second, in the mapper of the updatedb tool:
```
17/09/28 18:14:22 INFO crawl.CrawlDb: CrawlDb update: starting at 2017-09-28
18:14:22
...
17/09/28 18:29:06 INFO mapreduce.Job: Task Id :
attempt_1505293155476_0253_m_000003_98, Status : FAILED
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
org.apache.nutch.crawl.CrawlDbFilter not found
at
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2203)
at
org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:196)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:751)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ClassNotFoundException: Class
org.apache.nutch.crawl.CrawlDbFilter not found
at
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2109)
at
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2201)
... 8 more
17/09/28 18:29:07 INFO mapreduce.Job: map 33% reduce 100%
17/09/28 18:29:08 INFO mapreduce.Job: map 100% reduce 100%
17/09/28 18:29:08 INFO mapreduce.Job: Job job_1505293155476_0253 failed with
state FAILED due to: Task failed task_1505293155476_0253_m_000001
Job failed as tasks failed. failedMaps:1 failedReduces:0
...
17/09/28 18:29:08 INFO crawl.CrawlDb: CrawlDb update: finished at 2017-09-28
18:29:08, elapsed: 00:14:45
```
I have no glue why, the classes are in the job file. It could be because of
some incompatibilities when running on Cloudera CDH 5.12.1 (Hadoop 2.6). I'll
try to investigate this problem.
Meanwhile, please, have a look at another issue uncovered: although a job
failed (note: a MapReduce job can be just one of multiple steps), the generate
or updatedb "job" (here: one run of a tool) signalized "success" and the crawl
script just continued as if there wasn't any problem. Please, always check the
return value of job.waitForCompletion(...) and if it returns false:
- perform the necessary cleanups: delete temporary data, etc.
- make the main routine return 1
I can only second Lewis: please, try to run tests independently in local and
pseudo-distributed mode. One iteration (commit/PR, test, analyze and report
error) takes too long otherwise. Thanks!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Upgrade the code base from org.apache.hadoop.mapred to
> org.apache.hadoop.mapreduce
> ----------------------------------------------------------------------------------
>
> Key: NUTCH-2375
> URL: https://issues.apache.org/jira/browse/NUTCH-2375
> Project: Nutch
> Issue Type: Improvement
> Components: deployment
> Reporter: Omkar Reddy
>
> Nutch is still using the deprecated org.apache.hadoop.mapred dependency which
> has been deprecated. It need to be updated to org.apache.hadoop.mapreduce
> dependency.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)