Mark Kerzner
Thu, 12 Nov 2009 05:01:22 -0800
Thank you, Jukka, will try that. Mark On Thu, Nov 12, 2009 at 5:33 AM, Jukka Zitting <jukka.zitt...@gmail.com>wrote:
> Hi, > > On Thu, Nov 12, 2009 at 4:42 AM, Mark Kerzner <markkerz...@gmail.com> > wrote: > > I tried to extract text from an Office 2207 Word and Excel, and Tika > thinks > > they are XML files. "file" command in Linux thinks they are "zip' files. > > Where should I look for the current format list? What are the plans for > > Office 2007? > > Tika has support for Office 2007 formats since the 0.3 release. But > note that we've only had autodetection support (magic byte patterns) > for Office 2007 since a few months in svn trunk, so what you're seeing > is probably the document being autodetected as a zip file, and then > the contained content being parsed as XML. > > You may want to check the latest svn trunk (or the upcoming 0.5 > release) to see if your problem has already been solved. If not, > please file a bug report in the issue tracker. > > BR, > > Jukka Zitting >