On Tue, 5 Feb 2013, saisantoshi wrote:
I am looking at the versions supported by newer version of Tika (1.3)
and was not sure what version(s) of the Microsoft office it supports
(97/2000/2010/2013) for each of the below?
http://tika.apache.org/1.3/formats.html#Microsoft_Office_document_formats
That section you reference should tell you all you need to know!
As stated there, the OLE2 formats (.doc, .xls and .ppt) from Office 97 are
supported (but not older office 95 etc), and the newer OOXML based formats
(.xlsx, .pptx, .docx) introduced with Office 2007 (and used by later
versions) are also supported
The parsers pull out all the common text, along with a fair amount of
formatting. It's possible that you may find a kind of text that they don't
currently extract (maybe if it's in some obscure new area of the file used
in the most recent office version, or maybe just in something old but
uncommonly used), in which case you'd need to open an new issue in JIRA +
upload a small file that shows the problem + ideally also upload a small
failing unit test.
Nick