Tim Allison created TIKA-1515:
---------------------------------
Summary: Old XLS 3 parsing is not working
Key: TIKA-1515
URL: https://issues.apache.org/jira/browse/TIKA-1515
Project: Tika
Issue Type: Bug
Reporter: Tim Allison
Priority: Minor
Thanks to [~gagravarr], we now have mime type id for excel.sheet.4 and
excel.sheet.3, and we have parsing for excel.sheet.4. It looks like there's
are two issues with excel.sheet.3 parsing on most excel.sheet.3 files in
govdocs1.
The predominant issue (169 out of 173) appears to stem from a bad/missing code
page parse:
{noformat}
Caused by: java.lang.IllegalArgumentException: Unsupported codepage requested
at
org.apache.poi.hssf.record.OldStringRecord.getString(OldStringRecord.java:83)
at
org.apache.poi.hssf.record.OldLabelRecord.getValue(OldLabelRecord.java:82)
at
org.apache.poi.hssf.extractor.OldExcelExtractor.getText(OldExcelExtractor.java:159)
at
org.apache.tika.parser.microsoft.OldExcelParser.parse(OldExcelParser.java:82)
at
org.apache.tika.parser.microsoft.OldExcelParser.parse(OldExcelParser.java:76)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
... 41 more
Caused by: java.io.UnsupportedEncodingException: Codepage number may not be
-32767
at
org.apache.poi.util.CodePageUtil.codepageToEncoding(CodePageUtil.java:275)
at
org.apache.poi.util.CodePageUtil.codepageToEncoding(CodePageUtil.java:253)
at
org.apache.poi.util.CodePageUtil.getStringFromCodePage(CodePageUtil.java:231)
at
org.apache.poi.util.CodePageUtil.getStringFromCodePage(CodePageUtil.java:219)
at
org.apache.poi.hssf.record.OldStringRecord.getString(OldStringRecord.java:81)
... 46 more
{noformat}
The second issue only affects 4 documents.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)