[
https://issues.apache.org/jira/browse/TIKA-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14276887#comment-14276887
]
Tim Allison commented on TIKA-1515:
-----------------------------------
Thank you, Nick! Will probably have time early next week.
> Old XLS 3 parsing is not working on some documents
> --------------------------------------------------
>
> Key: TIKA-1515
> URL: https://issues.apache.org/jira/browse/TIKA-1515
> Project: Tika
> Issue Type: Bug
> Reporter: Tim Allison
> Priority: Minor
> Attachments: 081247.unk.xls
>
>
> Thanks to [~gagravarr], we now have mime type id for excel.sheet.4 and
> excel.sheet.3, and we have parsing for excel.sheet.4. It looks like there
> are two issues with excel.sheet.3 parsing on most excel.sheet.3 files in
> govdocs1.
> The predominant issue (169 out of 175 files) appears to stem from a
> bad/missing code page parse:
> {noformat}
> Caused by: java.lang.IllegalArgumentException: Unsupported codepage requested
> at
> org.apache.poi.hssf.record.OldStringRecord.getString(OldStringRecord.java:83)
> at
> org.apache.poi.hssf.record.OldLabelRecord.getValue(OldLabelRecord.java:82)
> at
> org.apache.poi.hssf.extractor.OldExcelExtractor.getText(OldExcelExtractor.java:159)
> at
> org.apache.tika.parser.microsoft.OldExcelParser.parse(OldExcelParser.java:82)
> at
> org.apache.tika.parser.microsoft.OldExcelParser.parse(OldExcelParser.java:76)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
> ... 41 more
> Caused by: java.io.UnsupportedEncodingException: Codepage number may not be
> -32767
> at
> org.apache.poi.util.CodePageUtil.codepageToEncoding(CodePageUtil.java:275)
> at
> org.apache.poi.util.CodePageUtil.codepageToEncoding(CodePageUtil.java:253)
> at
> org.apache.poi.util.CodePageUtil.getStringFromCodePage(CodePageUtil.java:231)
> at
> org.apache.poi.util.CodePageUtil.getStringFromCodePage(CodePageUtil.java:219)
> at
> org.apache.poi.hssf.record.OldStringRecord.getString(OldStringRecord.java:81)
> ... 46 more
> {noformat}
> The second issue only affects 4 documents.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)