Jukka Zitting
Thu, 12 Nov 2009 03:34:45 -0800
Hi, On Thu, Nov 12, 2009 at 4:42 AM, Mark Kerzner <markkerz...@gmail.com> wrote: > I tried to extract text from an Office 2207 Word and Excel, and Tika thinks > they are XML files. "file" command in Linux thinks they are "zip' files. > Where should I look for the current format list? What are the plans for > Office 2007?
Tika has support for Office 2007 formats since the 0.3 release. But note that we've only had autodetection support (magic byte patterns) for Office 2007 since a few months in svn trunk, so what you're seeing is probably the document being autodetected as a zip file, and then the contained content being parsed as XML. You may want to check the latest svn trunk (or the upcoming 0.5 release) to see if your problem has already been solved. If not, please file a bug report in the issue tracker. BR, Jukka Zitting