On Tue, 5 Feb 2013, saisantoshi wrote:
I am looking at the versions supported by newer version of Tika (1.3) and was not sure what version(s) of the Microsoft office it supports (97/2000/2010/2013) for each of the below?

http://tika.apache.org/1.3/formats.html#Microsoft_Office_document_formats

That section you reference should tell you all you need to know!

As stated there, the OLE2 formats (.doc, .xls and .ppt) from Office 97 are supported (but not older office 95 etc), and the newer OOXML based formats (.xlsx, .pptx, .docx) introduced with Office 2007 (and used by later versions) are also supported

The parsers pull out all the common text, along with a fair amount of formatting. It's possible that you may find a kind of text that they don't currently extract (maybe if it's in some obscure new area of the file used in the most recent office version, or maybe just in something old but uncommonly used), in which case you'd need to open an new issue in JIRA + upload a small file that shows the problem + ideally also upload a small failing unit test.

Nick

Reply via email to