I'm setting up Nutch trying to follow various tutorials and just tried
to separate out the fetching from parsing.
Unfortunately I got a confusing ArrayIndexOutOfBounds exception when
trying to parse. I couldn't figure out what it was complaining about
(line 96 of ParseSegment.java)
Adding this try catch block helped me out a bit, but still didn't
clear things up.
Index: ParseSegment.java
===================================================================
--- ParseSegment.java (revision 953602)
+++ ParseSegment.java (working copy)
@@ -92,9 +92,13 @@
Text url = entry.getKey();
Parse parse = entry.getValue();
ParseStatus parseStatus = parse.getData().getStatus();
-
+
+ try {
reporter.incrCounter("ParserStatus",
ParseStatus.majorCodes[parseStatus.getMajorCode()], 1);
-
+ } catch (ArrayIndexOutOfBoundsException e) {
+ LOG.error("Ununsual ParserStatus - possibly
misconfiguration : " + parseStatus.getMajorCode() );
+ }
+
It shouldn't abort parsing the whole "part" just because it can't
parse one type of file.
Now majorCodes looks like an Enumeration, but isn't one. Apparently
the unusual majorCode is "-56" according to the log below. I can't see
where that is coming from. For me it seems to be a problem with files
with mime-types application/atom+xml, application/rss+xml.
2010-06-13 14:57:09,089 WARN parse.ParserFactory -
ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to
contentType applicati
on/xhtml+xml via parse-plugins.xml, but its plugin.xml file does not
claim to support contentType: application/xhtml+xml
2010-06-13 14:57:09,286 INFO crawl.SignatureFactory - Using Signature
impl: org.apache.nutch.crawl.MD5Signature
2010-06-13 14:57:10,932 INFO parse.ParserFactory - The parsing
plugins: [org.apache.nutch.parse.tika.Parser -
org.apache.nutch.parse.html.HtmlP
arser] are enabled via the plugin.includes system property, and all
claim to support the content type text/html, but they are not mapped
to it
in the parse-plugins.xml file
2010-06-13 14:57:11,404 INFO parse.ParserFactory - The parsing
plugins: [org.apache.nutch.parse.tika.Parser] are enabled via the
plugin.include
s system property, and all claim to support the content type
application/atom+xml, but they are not mapped to it in the
parse-plugins.xml file
2010-06-13 14:57:11,405 ERROR tika.TikaParser - Can't retrieve Tika
parser for mime-type application/atom+xml
2010-06-13 14:57:11,405 ERROR parse.Parser - Ununsual ParserStatus -
possibly misconfiguration : -56
2010-06-13 14:57:11,405 WARN parse.Parser - Error parsing:
http://www.mytestsite.com/blog/atom.xml: UNKNOWN!(-56,0): Can't
retrieve Tika parser for mime-type application/atom+xml
2010-06-13 14:57:11,405 WARN parse.ParserFactory - ParserFactory:
Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType
application/rss+xml via parse-plugins.xml, but not enabled via
plugin.includes in nutch-default.xml
Now presumably my configuration is wrong and I can't parse those mime
types. Should I care? I don't currently care about xml.
I am using code packaged as 1.1 Release Candidate but think that
trivial try/catch should be put on the ParseSegment.java anyway.
Anyone know how a parserStatus got a major code of -56 and should that
be possible?
Thanks
Alex