tika-user  

Re: Office 2007?

Mark Kerzner
Thu, 12 Nov 2009 05:01:22 -0800

Thank you, Jukka, will try that.
Mark

On Thu, Nov 12, 2009 at 5:33 AM, Jukka Zitting <jukka.zitt...@gmail.com>wrote:

> Hi,
>
> On Thu, Nov 12, 2009 at 4:42 AM, Mark Kerzner <markkerz...@gmail.com>
> wrote:
> > I tried to extract text from an Office 2207 Word and Excel, and Tika
> thinks
> > they are XML files. "file" command in Linux thinks they are "zip' files.
> > Where should I look for the current format list? What are the plans for
> > Office 2007?
>
> Tika has support for Office 2007 formats since the 0.3 release. But
> note that we've only had autodetection support (magic byte patterns)
> for Office 2007 since a few months in svn trunk, so what you're seeing
> is probably the document being autodetected as a zip file, and then
> the contained content being parsed as XML.
>
> You may want to check the latest svn trunk (or the upcoming 0.5
> release) to see if your problem has already been solved. If not,
> please file a bug report in the issue tracker.
>
> BR,
>
> Jukka Zitting
>