Hello!
-10.01.-28163 22:59, Nick Burch пишет:
At the moment, for OLE2 based files (.xls, .ppt, .doc, .msg, .vsd
etc), and for ZIP based files (.zip, but also .xlsx, .pptx, .docx,
.odf, .odt, .ots, .sxw etc), I don't think the current method works
well. AFAICT,
we detect the container, then have sub-class matches that try to look
for the appropriate children by hoping we can guess where the
definition might hide within the container. However, I think this is
too unreliable - for example, with a .doc file, the entry for the Word
stream can come anywhere in the list of top level entries, so is very
hard to reliably find without properly parsing the OLE2 structure
I tried to do that, but I found that this does not fit into Tika
architecture. It is required to read whole file to parse OLE-container.
Tika works with streams, so we can
1) remove streaming support and work only with files (or save stream
into temporaty file before processing), or
2) parse OLE-container on mime-type detection and transfer it to text
extractor (parser)
I do not like first solution, but the second requires architecture
changes in Tika.
Anyway, I wrote type detection code for OLE in TIKA-437.
best wishes, Max