Hello!

-10.01.-28163 22:59, Nick Burch пишет:
At the moment, for OLE2 based files (.xls, .ppt, .doc, .msg, .vsd etc), and for ZIP based files (.zip, but also .xlsx, .pptx, .docx, .odf, .odt, .ots, .sxw etc), I don't think the current method works well. AFAICT, we detect the container, then have sub-class matches that try to look for the appropriate children by hoping we can guess where the definition might hide within the container. However, I think this is too unreliable - for example, with a .doc file, the entry for the Word stream can come anywhere in the list of top level entries, so is very hard to reliably find without properly parsing the OLE2 structure

I tried to do that, but I found that this does not fit into Tika architecture. It is required to read whole file to parse OLE-container. Tika works with streams, so we can

1) remove streaming support and work only with files (or save stream into temporaty file before processing), or 2) parse OLE-container on mime-type detection and transfer it to text extractor (parser)

I do not like first solution, but the second requires architecture changes in Tika.

Anyway, I wrote type detection code for OLE in TIKA-437.

best wishes, Max

Reply via email to