tika-user  

Re: Office 2007?

Jukka Zitting
Thu, 12 Nov 2009 03:34:45 -0800

Hi,

On Thu, Nov 12, 2009 at 4:42 AM, Mark Kerzner <markkerz...@gmail.com> wrote:
> I tried to extract text from an Office 2207 Word and Excel, and Tika thinks
> they are XML files. "file" command in Linux thinks they are "zip' files.
> Where should I look for the current format list? What are the plans for
> Office 2007?

Tika has support for Office 2007 formats since the 0.3 release. But
note that we've only had autodetection support (magic byte patterns)
for Office 2007 since a few months in svn trunk, so what you're seeing
is probably the document being autodetected as a zip file, and then
the contained content being parsed as XML.

You may want to check the latest svn trunk (or the upcoming 0.5
release) to see if your problem has already been solved. If not,
please file a bug report in the issue tracker.

BR,

Jukka Zitting