Re: Detecting container formats

Max Valjanski Thu, 17 Jun 2010 00:58:48 -0700

Hello!

-10.01.-28163 22:59, Nick Burch пишет:

At the moment, for OLE2 based files (.xls, .ppt, .doc, .msg, .vsdetc), and for ZIP based files (.zip, but also .xlsx, .pptx, .docx,.odf, .odt, .ots, .sxw etc), I don't think the current method workswell. AFAICT,we detect the container, then have sub-class matches that try to lookfor the appropriate children by hoping we can guess where thedefinition might hide within the container. However, I think this istoo unreliable - for example, with a .doc file, the entry for the Wordstream can come anywhere in the list of top level entries, so is veryhard to reliably find without properly parsing the OLE2 structure

I tried to do that, but I found that this does not fit into Tikaarchitecture. It is required to read whole file to parse OLE-container.Tika works with streams, so we can

1) remove streaming support and work only with files (or save streaminto temporaty file before processing), or2) parse OLE-container on mime-type detection and transfer it to textextractor (parser)

I do not like first solution, but the second requires architecturechanges in Tika.


Anyway, I wrote type detection code for OLE in TIKA-437.

best wishes, Max

Re: Detecting container formats

Reply via email to