What are the ParseStatus major codes?

Alex McLintock Sun, 13 Jun 2010 07:13:48 -0700

I'm setting up Nutch trying to follow various tutorials and just tried
to separate out the fetching from parsing.


Unfortunately I got a confusing ArrayIndexOutOfBounds exception when
trying to parse. I couldn't figure out what it was complaining about
(line 96 of ParseSegment.java)

Adding this try catch block helped me out a bit, but still didn't
clear things up.



Index: ParseSegment.java
===================================================================
--- ParseSegment.java   (revision 953602)
+++ ParseSegment.java   (working copy)
@@ -92,9 +92,13 @@
       Text url = entry.getKey();
       Parse parse = entry.getValue();
       ParseStatus parseStatus = parse.getData().getStatus();
-
+
+      try {
       reporter.incrCounter("ParserStatus",
ParseStatus.majorCodes[parseStatus.getMajorCode()], 1);
-
+      } catch (ArrayIndexOutOfBoundsException e) {
+          LOG.error("Ununsual ParserStatus - possibly
misconfiguration : " + parseStatus.getMajorCode() );
+      }
+


It shouldn't abort parsing the whole "part" just because it can't
parse one type of file.



Now majorCodes looks like an Enumeration, but isn't one. Apparently
the unusual majorCode is "-56" according to the log below. I can't see
where that is coming from. For me it seems to be a problem with files
with mime-types application/atom+xml, application/rss+xml.



2010-06-13 14:57:09,089 WARN  parse.ParserFactory -
ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to
contentType applicati
on/xhtml+xml via parse-plugins.xml, but its plugin.xml file does not
claim to support contentType: application/xhtml+xml
2010-06-13 14:57:09,286 INFO  crawl.SignatureFactory - Using Signature
impl: org.apache.nutch.crawl.MD5Signature
2010-06-13 14:57:10,932 INFO  parse.ParserFactory - The parsing
plugins: [org.apache.nutch.parse.tika.Parser -
org.apache.nutch.parse.html.HtmlP
arser] are enabled via the plugin.includes system property, and all
claim to support the content type text/html, but they are not mapped
to it
in the parse-plugins.xml file
2010-06-13 14:57:11,404 INFO  parse.ParserFactory - The parsing
plugins: [org.apache.nutch.parse.tika.Parser] are enabled via the
plugin.include
s system property, and all claim to support the content type
application/atom+xml, but they are not mapped to it  in the
parse-plugins.xml file
2010-06-13 14:57:11,405 ERROR tika.TikaParser - Can't retrieve Tika
parser for mime-type application/atom+xml
2010-06-13 14:57:11,405 ERROR parse.Parser - Ununsual ParserStatus -
possibly misconfiguration : -56
2010-06-13 14:57:11,405 WARN  parse.Parser - Error parsing:
http://www.mytestsite.com/blog/atom.xml: UNKNOWN!(-56,0): Can't
retrieve Tika parser for mime-type application/atom+xml
2010-06-13 14:57:11,405 WARN  parse.ParserFactory - ParserFactory:
Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType
application/rss+xml via parse-plugins.xml, but not enabled via
plugin.includes in nutch-default.xml


Now presumably my configuration is wrong and I can't parse those mime
types. Should I care? I don't currently care about xml.

I am using code packaged as 1.1 Release Candidate but think that
trivial try/catch should be put on the ParseSegment.java anyway.

Anyone know how a parserStatus got a major code of -56 and should that
be possible?

Thanks

Alex

What are the ParseStatus major codes?

Reply via email to