Arkadi Kosmynin created NUTCH-2071:
--------------------------------------

             Summary:  A parser failure on a single document may fail crawling 
job
                 Key: NUTCH-2071
                 URL: https://issues.apache.org/jira/browse/NUTCH-2071
             Project: Nutch
          Issue Type: Bug
          Components: parser
            Reporter: Arkadi Kosmynin


java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
        at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213)
        <...>
Caused by: java.lang.IncompatibleClassChangeError: class 
org.apache.tika.parser.asm.XHTMLClassVisitor has interface 
org.objectweb.asm.ClassVisitor as super class
                at java.lang.ClassLoader.defineClass1(Native Method)
                at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
                at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
                at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
                at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
                at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
                at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
                at java.security.AccessController.doPrivileged(Native Method)
                at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
                at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
                at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
                at 
org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51)
                at 
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:98)
                at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:103)

Suggested fix in ParseUtil:

Replace 

            if (maxParseTime!=-1)
                       parseResult = runParser(parsers[i], content);
            else 
                       parseResult = parsers[i].getParse(content);

with

      try
      {
            if (maxParseTime!=-1)
                       parseResult = runParser(parsers[i], content);
            else 
                       parseResult = parsers[i].getParse(content);
      } catch( Throwable e )
      {
        LOG.warn( "Parsing " + content.getUrl() + " with " + 
parsers[i].getClass().getName() + " failed: " + e.getMessage() ) ;
        parseResult = null ;
      }




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to