[
https://issues.apache.org/jira/browse/NUTCH-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958660#comment-15958660
]
Sebastian Nagel commented on NUTCH-2071:
----------------------------------------
- caused by a library/dependency conflict (NUTCH-2316)
- see also [discussion
user@nutch|https://lists.apache.org/thread.html/e50099016d17b609a0db0bfcd75cbdf6ca281cf4f9b75700af8e4666@1435024795@%3Cuser.nutch.apache.org%3E]
> A parser failure on a single document may fail crawling job
> ------------------------------------------------------------
>
> Key: NUTCH-2071
> URL: https://issues.apache.org/jira/browse/NUTCH-2071
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Reporter: Arkadi Kosmynin
> Attachments: NUTCH-2071.diff
>
>
> java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
> at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213)
> <...>
> Caused by: java.lang.IncompatibleClassChangeError: class
> org.apache.tika.parser.asm.XHTMLClassVisitor has interface
> org.objectweb.asm.ClassVisitor as super class
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
> at
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at
> java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at
> org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51)
> at
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:98)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:103)
> Suggested fix in ParseUtil:
> Replace
> if (maxParseTime!=-1)
> parseResult = runParser(parsers[i], content);
> else
> parseResult = parsers[i].getParse(content);
> with
> try
> {
> if (maxParseTime!=-1)
> parseResult = runParser(parsers[i], content);
> else
> parseResult = parsers[i].getParse(content);
> } catch( Throwable e )
> {
> LOG.warn( "Parsing " + content.getUrl() + " with " +
> parsers[i].getClass().getName() + " failed: " + e.getMessage() ) ;
> parseResult = null ;
> }
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)