[
https://issues.apache.org/jira/browse/TIKA-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14276053#comment-14276053
]
Nick Burch commented on TIKA-1515:
----------------------------------
Hopefully fixed in Apache POI in r1651517 - it seems Biff 2 and 3 use codepage
values past the short negative wraparound number, which we weren't handling.
Any chance you could grab a nightly / svn build and see if that behaves nicely?
If so, we can try to roll a new POI beta fairly soon for the fix
> Old XLS 3 parsing is not working on some documents
> --------------------------------------------------
>
> Key: TIKA-1515
> URL: https://issues.apache.org/jira/browse/TIKA-1515
> Project: Tika
> Issue Type: Bug
> Reporter: Tim Allison
> Priority: Minor
> Attachments: 081247.unk.xls
>
>
> Thanks to [~gagravarr], we now have mime type id for excel.sheet.4 and
> excel.sheet.3, and we have parsing for excel.sheet.4. It looks like there's
> are two issues with excel.sheet.3 parsing on most excel.sheet.3 files in
> govdocs1.
> The predominant issue (169 out of 175) appears to stem from a bad/missing
> code page parse:
> {noformat}
> Caused by: java.lang.IllegalArgumentException: Unsupported codepage requested
> at
> org.apache.poi.hssf.record.OldStringRecord.getString(OldStringRecord.java:83)
> at
> org.apache.poi.hssf.record.OldLabelRecord.getValue(OldLabelRecord.java:82)
> at
> org.apache.poi.hssf.extractor.OldExcelExtractor.getText(OldExcelExtractor.java:159)
> at
> org.apache.tika.parser.microsoft.OldExcelParser.parse(OldExcelParser.java:82)
> at
> org.apache.tika.parser.microsoft.OldExcelParser.parse(OldExcelParser.java:76)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
> ... 41 more
> Caused by: java.io.UnsupportedEncodingException: Codepage number may not be
> -32767
> at
> org.apache.poi.util.CodePageUtil.codepageToEncoding(CodePageUtil.java:275)
> at
> org.apache.poi.util.CodePageUtil.codepageToEncoding(CodePageUtil.java:253)
> at
> org.apache.poi.util.CodePageUtil.getStringFromCodePage(CodePageUtil.java:231)
> at
> org.apache.poi.util.CodePageUtil.getStringFromCodePage(CodePageUtil.java:219)
> at
> org.apache.poi.hssf.record.OldStringRecord.getString(OldStringRecord.java:81)
> ... 46 more
> {noformat}
> The second issue only affects 4 documents.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)