tika-user  

Re: parsing old Excel files

Alex Ott
Wed, 16 Dec 2009 12:56:31 -0800

Re

Tomas Fernandez Lobbe  at "Wed, 16 Dec 2009 11:26:26 -0800 (PST)" wrote:
 TFL> Hi, I'm trying to parse a big set of  Miscrosoft Word and Microsoft Excel 
files. I'm having a problem with some old excel files,
 TFL> they are not being parsed (both, metadata and content info is empty after 
parseing them).

 TFL> For example, if I run a test similar to ExcelParserTest with my old excel 
file, the parsing doesn't return any data.
 TFL> Debugging the parser code (OfficeParser) a little bit I found that there 
is not an entry with the the name "Workbook" in this excel
 TFL> file, there is an entry with the name "Book" instead, but anyway, the 
ExcelExtractor wont work with this file (tried it).

 TFL> Did someone faced this problem before? Does somebody knows the first 
excel version that can be parsed with tika?

I think, that first supported version is Office 97. Previous formats aren't
documented, although there is some documentation about them (in xls2csv
from catdoc package, for example).  But these format are very different
from MS Office 97-2003 formats


-- 
With best wishes, Alex Ott, MBA
http://alexott.blogspot.com/        http://xtalk.msk.su/~ott/
http://alexott-ru.blogspot.com/