[jira] [Commented] (NUTCH-2375) Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce

ASF GitHub Bot (JIRA) Thu, 28 Sep 2017 12:48:39 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16184721#comment-16184721
 ]


ASF GitHub Bot commented on NUTCH-2375:
---------------------------------------

sebastian-nagel commented on issue #221: NUTCH-2375 Upgrading nutch to use 
org.apache.hadoop.mapreduce
URL: https://github.com/apache/nutch/pull/221#issuecomment-332941883
 
 
   Hi @Omkar20895,
   
   running a test crawl on a Hadoop cluster failed again. I've got two 
ClassNotFoundException-s, first in Generator, in the mapper of the last step 
"partitioning":
   
   ```
   17/09/28 17:38:16 INFO crawl.Generator: Generator: Partitioning selected 
urls for politeness.
   
   ...
   
   17/09/28 17:54:56 INFO mapreduce.Job: Task Id : 
attempt_1505293155476_0250_m_000000_98, Status : FAILED
   Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class 
org.apache.nutch.crawl.Generator$SelectorInverseMapper not found
           at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2203)
           at 
org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:196)
           at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:751)
           at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
           at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
           at java.security.AccessController.doPrivileged(Native Method)
           at javax.security.auth.Subject.doAs(Subject.java:422)
           at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
           at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
   Caused by: java.lang.ClassNotFoundException: Class 
org.apache.nutch.crawl.Generator$SelectorInverseMapper not found
           at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2109)
           at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2201)
           ... 8 more
   
   17/09/28 17:55:07 INFO mapreduce.Job:  map 100% reduce 100%
   17/09/28 17:55:07 INFO mapreduce.Job: Job job_1505293155476_0250 failed with 
state FAILED due to: Task failed task_1505293155476_0250_m_000000
   Job failed as tasks failed. failedMaps:1 failedReduces:0
   
   17/09/28 17:55:07 INFO mapreduce.Job: Counters: 9
           Job Counters 
                   Failed map tasks=100
                   Killed reduce tasks=2
                   Launched map tasks=100
                   Other local map tasks=100
                   Total time spent by all maps in occupied slots (ms)=2073900
                   Total time spent by all reduces in occupied slots (ms)=0
                   Total time spent by all map tasks (ms)=691300
                   Total vcore-milliseconds taken by all map tasks=691300
                   Total megabyte-milliseconds taken by all map tasks=2123673600
   17/09/28 17:55:07 INFO crawl.Generator: Generator: finished at 2017-09-28 
17:55:07, elapsed: 00:28:26
   ```
   
   Second, in the mapper of the updatedb tool:
   
   ```
   17/09/28 18:14:22 INFO crawl.CrawlDb: CrawlDb update: starting at 2017-09-28 
18:14:22
   
   ...
   
   17/09/28 18:29:06 INFO mapreduce.Job: Task Id : 
attempt_1505293155476_0253_m_000003_98, Status : FAILED
   Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class 
org.apache.nutch.crawl.CrawlDbFilter not found
           at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2203)
           at 
org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:196)
           at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:751)
           at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
           at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
           at java.security.AccessController.doPrivileged(Native Method)
           at javax.security.auth.Subject.doAs(Subject.java:422)
           at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
           at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
   Caused by: java.lang.ClassNotFoundException: Class 
org.apache.nutch.crawl.CrawlDbFilter not found
           at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2109)
           at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2201)
           ... 8 more
   
   17/09/28 18:29:07 INFO mapreduce.Job:  map 33% reduce 100%
   17/09/28 18:29:08 INFO mapreduce.Job:  map 100% reduce 100%
   17/09/28 18:29:08 INFO mapreduce.Job: Job job_1505293155476_0253 failed with 
state FAILED due to: Task failed task_1505293155476_0253_m_000001
   Job failed as tasks failed. failedMaps:1 failedReduces:0
   
   ...
   
   17/09/28 18:29:08 INFO crawl.CrawlDb: CrawlDb update: finished at 2017-09-28 
18:29:08, elapsed: 00:14:45
   ```
   
   I have no glue why, the classes are in the job file. It could be because of 
some incompatibilities when running on Cloudera CDH 5.12.1 (Hadoop 2.6). I'll 
try to investigate this problem.
   
   Meanwhile, please, have a look at another issue uncovered: although a job 
failed (note: a MapReduce job can be just one of multiple steps), the generate 
or updatedb "job" (here: one run of a tool) signalized "success" and the crawl 
script just continued as if there wasn't any problem. Please, always check the 
return value of job.waitForCompletion(...) and if it returns false:
   - perform the necessary cleanups: delete temporary data, etc.
   - make the main routine return 1
   
   I can only second Lewis: please, try to run tests independently in local and 
pseudo-distributed mode. One iteration (commit/PR, test, analyze and report 
error) takes too long otherwise. Thanks!
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Upgrade the code base from org.apache.hadoop.mapred to 
> org.apache.hadoop.mapreduce
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-2375
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2375
>             Project: Nutch
>          Issue Type: Improvement
>          Components: deployment
>            Reporter: Omkar Reddy
>
> Nutch is still using the deprecated org.apache.hadoop.mapred dependency which 
> has been deprecated. It need to be updated to org.apache.hadoop.mapreduce 
> dependency. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2375) Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce

Reply via email to