[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606592#comment-14606592
]
Asitang Mishra edited comment on NUTCH-2038 at 6/30/15 5:43 PM:
----------------------------------------------------------------
Hi [~wastl-nagel] and Hi [~lewismc],
Please, take a look at the latest patch and help me figure out the exception!!,
I am facing the following issue when running in local (please test the latest
pull for this). This I even faced in the pull #40 here. Please test and see if
you are facing it too.
I have added all the dependencies, dont seem to understand why it's still givin
class not found!!!
java.lang.Exception: java.lang.RuntimeException:
java.lang.ClassNotFoundException:
org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException:
org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857)
at
org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:718)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException:
org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:340)
at
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:855)
... 9 more
2015-06-29 15:45:05,038 ERROR naivebayes.NaiveBayesParseFilter - Error occured
while training:: java.lang.IllegalStateException: Job failed!
at
org.apache.mahout.vectorizer.DocumentProcessor.tokenizeDocuments(DocumentProcessor.java:95)
at
org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:257)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at
org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:56)
at
org.apache.nutch.parsefilter.naivebayes.NaiveBayesClassifier.createModel(NaiveBayesClassifier.java:105)
at
org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.train(NaiveBayesParseFilter.java:90)
at
org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.setConf(NaiveBayesParseFilter.java:160)
at
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163)
at
org.apache.nutch.plugin.PluginRepository.getOrderedPlugins(PluginRepository.java:441)
at
org.apache.nutch.parse.HtmlParseFilters.<init>(HtmlParseFilters.java:35)
at org.apache.nutch.parse.html.HtmlParser.setConf(HtmlParser.java:343)
at
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163)
at
org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:136)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:78)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:104)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:46)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
at
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
was (Author: asitang):
Hi [~wastl-nagel],
I am facing the following issue when running in local (please test the latest
pull for this). This I even faced in the pull #40 here. Please test and see if
you are facing it too.
I have added all the dependencies, dont seem to understand why it's still givin
class not found!!!
java.lang.Exception: java.lang.RuntimeException:
java.lang.ClassNotFoundException:
org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException:
org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857)
at
org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:718)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException:
org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:340)
at
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:855)
... 9 more
2015-06-29 15:45:05,038 ERROR naivebayes.NaiveBayesParseFilter - Error occured
while training:: java.lang.IllegalStateException: Job failed!
at
org.apache.mahout.vectorizer.DocumentProcessor.tokenizeDocuments(DocumentProcessor.java:95)
at
org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:257)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at
org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:56)
at
org.apache.nutch.parsefilter.naivebayes.NaiveBayesClassifier.createModel(NaiveBayesClassifier.java:105)
at
org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.train(NaiveBayesParseFilter.java:90)
at
org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.setConf(NaiveBayesParseFilter.java:160)
at
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163)
at
org.apache.nutch.plugin.PluginRepository.getOrderedPlugins(PluginRepository.java:441)
at
org.apache.nutch.parse.HtmlParseFilters.<init>(HtmlParseFilters.java:35)
at org.apache.nutch.parse.html.HtmlParser.setConf(HtmlParser.java:343)
at
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163)
at
org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:136)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:78)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:104)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:46)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
at
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
> Naive Bayes classifier based html Parse filter (for filtering outlinks)
> -----------------------------------------------------------------------
>
> Key: NUTCH-2038
> URL: https://issues.apache.org/jira/browse/NUTCH-2038
> Project: Nutch
> Issue Type: New Feature
> Components: fetcher, injector, parser
> Reporter: Asitang Mishra
> Assignee: Chris A. Mattmann
> Labels: memex, nutch
> Fix For: 1.11
>
>
> A html parse filter that will filter out the outlinks in two stages.
> Classify the parse text and decide if the parent page is relevant. If
> relevant then don't filter the outlinks. If irrelevant then go thru each
> outlink and see if the url contains any of the important words from a list.
> If it does then let it pass.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)