[
https://issues.apache.org/jira/browse/TIKA-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228990#comment-14228990
]
Nick Burch commented on TIKA-1490:
----------------------------------
As of r1642497, there is now a basic parser present in Apache POI, which we can
call out to once we upgrade to 3.11 final
The code as it stands doesn't handle format strings (so $100.11 will come out
as something like 100.1100010101), and doesn't handle codepages for the strings
(assumes all 8 bit strings are default encoding). If those prove to be
problematic for people, it'll hopefully be fairly easy to extend for those too,
based on extending existing POI code for that to handle the old records too
> Basic parser for old Excel files (eg Excel 4)
> ---------------------------------------------
>
> Key: TIKA-1490
> URL: https://issues.apache.org/jira/browse/TIKA-1490
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.6
> Reporter: Nick Burch
> Assignee: Nick Burch
>
> In TIKA-1487, we added mime magic for the pre-OLE2 excel file formats. Based
> on the reading of the OpenOffice Excel docs for that, it looks like it should
> be possible to produce a basic parser to extract key bits of info (eg
> strings) from these older file formats.
> This would likely largely be done by having a custom record iterator for the
> older formats, then passing the handful of "interesting" records to POI's
> record classes (maybe with some tweaks for the older formats) to have the
> binary data parsed, then returned by the parser
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)