Hi,
I initially thought there was an issue with POI so I posted my initial
question on the POI-user list.
Actually, now I see this is happening in the Nutch classes for the MS parse
plugin, not POI, so I'm giving this list a go.
Here's a trace I get when I catch any exception occurring as I attempt to
call the MSExcelParser's getParse(Content). It seems I get an NPE in
MSBaseParser.getParse().
[#|2006-10-04T09:13:15.102+0200|WARNING|sun-appserver-ee9.1|javax.enterprise.system.stream.err|_ThreadID=16;_ThreadName=httpWorkerThread-8080-1;_RequestID=0b18e2ae-0f79-4241-9e29-a322c8ae2bc6;|
java.lang.NullPointerException
at org.apache.nutch.parse.ms.MSBaseParser.getParse(MSBaseParser.java:94)
at
org.apache.nutch.parse.msexcel.MSExcelParser.getParse(MSExcelParser.java:40)
at
<my_package>.DocumentParser.parseDocument(DocumentParser.java:154)
...
Looking at the source (MSBaseParser.java) at this line, it goes:
****SNIP****
extractor.extract(new ByteArrayInputStream(raw));
text = extractor.getText();
properties = extractor.getProperties();
outlinks = OutlinkExtractor.getOutlinks(text, content.getUrl(),
getConf());
} catch (Exception e) {
return new ParseStatus(ParseStatus.FAILED,
"Can't be handled as micrsosoft document. " +
e)
.getEmptyParse(this.conf);
}
// collect meta data
Metadata metadata = new Metadata();
title = properties.getProperty(DublinCore.TITLE); <========== This
is line 94 as indicated in the trace
properties.remove(DublinCore.TITLE);
****SNIP****
So I can only gather that my properties object is null. As seen above in the
snippet from the MSBaseParser source, properties is initially null but
assigned a value from the ExcelExtractor (properties =
extractor.getProperties();) which I assume is becoming null.
Any ideas how I can get around this or if I'm not setting some required
properties?
Btw, I've noticed a spelling mistake in the ParseStatus that is returned in
the above lines of code; "Micrsosoft"
Thanks,
Trym
--
View this message in context:
http://www.nabble.com/Problem-parsing-some-MS-Excel-documents-%28Office-2003%29-tf2380851.html#a6635140
Sent from the Nutch - User mailing list archive at Nabble.com.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general