[
https://issues.apache.org/jira/browse/NUTCH-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16546447#comment-16546447
]
ASF GitHub Bot commented on NUTCH-2071:
---------------------------------------
sebastian-nagel closed pull request #358: NUTCH-2071 A parser failure on a
single document may fail crawling job if parser.timeout=-1
URL: https://github.com/apache/nutch/pull/358
This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:
As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):
diff --git a/src/java/org/apache/nutch/parse/ParseUtil.java
b/src/java/org/apache/nutch/parse/ParseUtil.java
index fd933b468..e77c97f23 100644
--- a/src/java/org/apache/nutch/parse/ParseUtil.java
+++ b/src/java/org/apache/nutch/parse/ParseUtil.java
@@ -91,10 +91,16 @@ public ParseResult parse(Content content) throws
ParseException {
LOG.debug("Parsing [" + content.getUrl() + "] with [" + parsers[i]
+ "]");
}
- if (maxParseTime != -1)
+ if (maxParseTime != -1) {
parseResult = runParser(parsers[i], content);
- else
- parseResult = parsers[i].getParse(content);
+ } else {
+ try {
+ parseResult = parsers[i].getParse(content);
+ } catch (Throwable e) {
+ LOG.warn("Error parsing " + content.getUrl() + " with "
+ + parsers[i].getClass().getName(), e);
+ }
+ }
if (parseResult != null && !parseResult.isEmpty())
return parseResult;
@@ -146,10 +152,16 @@ public ParseResult parseByExtensionId(String extId,
Content content)
}
ParseResult parseResult = null;
- if (maxParseTime != -1)
+ if (maxParseTime != -1) {
parseResult = runParser(p, content);
- else
- parseResult = p.getParse(content);
+ } else {
+ try {
+ parseResult = p.getParse(content);
+ } catch (Throwable e) {
+ LOG.warn("Error parsing " + content.getUrl() + " with "
+ + p.getClass().getName(), e);
+ }
+ }
if (parseResult != null && !parseResult.isEmpty()) {
return parseResult;
} else {
@@ -170,7 +182,8 @@ private ParseResult runParser(Parser p, Content content) {
try {
res = task.get(maxParseTime, TimeUnit.SECONDS);
} catch (Exception e) {
- LOG.warn("Error parsing " + content.getUrl() + " with " + p, e);
+ LOG.warn("Error parsing " + content.getUrl() + " with "
+ + p.getClass().getName(), e);
task.cancel(true);
} finally {
pc = null;
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> A parser failure on a single document may fail crawling job if
> parser.timeout=-1
> ---------------------------------------------------------------------------------
>
> Key: NUTCH-2071
> URL: https://issues.apache.org/jira/browse/NUTCH-2071
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.11
> Reporter: Arkadi Kosmynin
> Assignee: Sebastian Nagel
> Priority: Major
> Fix For: 1.14, 1.15
>
> Attachments: NUTCH-2071.diff
>
>
> java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
> at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213)
> <...>
> Caused by: java.lang.IncompatibleClassChangeError: class
> org.apache.tika.parser.asm.XHTMLClassVisitor has interface
> org.objectweb.asm.ClassVisitor as super class
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
> at
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at
> java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at
> org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51)
> at
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:98)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:103)
> Suggested fix in ParseUtil:
> Replace
> if (maxParseTime!=-1)
> parseResult = runParser(parsers[i], content);
> else
> parseResult = parsers[i].getParse(content);
> with
> try
> {
> if (maxParseTime!=-1)
> parseResult = runParser(parsers[i], content);
> else
> parseResult = parsers[i].getParse(content);
> } catch( Throwable e )
> {
> LOG.warn( "Parsing " + content.getUrl() + " with " +
> parsers[i].getClass().getName() + " failed: " + e.getMessage() ) ;
> parseResult = null ;
> }
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)