Alex Ott
Wed, 16 Dec 2009 12:56:31 -0800
Re Tomas Fernandez Lobbe at "Wed, 16 Dec 2009 11:26:26 -0800 (PST)" wrote: TFL> Hi, I'm trying to parse a big set of Miscrosoft Word and Microsoft Excel files. I'm having a problem with some old excel files, TFL> they are not being parsed (both, metadata and content info is empty after parseing them).
TFL> For example, if I run a test similar to ExcelParserTest with my old excel file, the parsing doesn't return any data. TFL> Debugging the parser code (OfficeParser) a little bit I found that there is not an entry with the the name "Workbook" in this excel TFL> file, there is an entry with the name "Book" instead, but anyway, the ExcelExtractor wont work with this file (tried it). TFL> Did someone faced this problem before? Does somebody knows the first excel version that can be parsed with tika? I think, that first supported version is Office 97. Previous formats aren't documented, although there is some documentation about them (in xls2csv from catdoc package, for example). But these format are very different from MS Office 97-2003 formats -- With best wishes, Alex Ott, MBA http://alexott.blogspot.com/ http://xtalk.msk.su/~ott/ http://alexott-ru.blogspot.com/