Re
Nick Burch at "Wed, 16 Jun 2010 12:01:48 +0100 (BST)" wrote:
NB> On Tue, 15 Jun 2010, Alex Ott wrote:
>> Hmmm, WordDocument stream in .doc could be only under / directory entry,
>> but yes - it
>> could anywhere in list of OLE2 entries...
NB> And the list of ole2 entries can come anywhere in the file - the header
block contains a
NB> pointer to the block holding the entries, which is normally near the start
but isn't
NB> required to be...
NB> Detecting OLE2 or Zip with magic seems easy enough, but as mentioned it's
whats inside
NB> them that I don't think magic + a few regexps on the first few kbs will
cut it :/
Yep, for OLE2 we need to get the whole file and generate list of entries in
it. For Zip, we also need to get the whole file, but it could be enough to
read list of entries, although, sometimes we need to read some files from
archive to get correct mime type (odf, {doc,ppt,xls}x, ...)
I'm not sure how it's better to implement this in Tika, I need to look into
sources. One possibility is to create hierarchy of container processors,
each of that will set corresponding subtype of container, and this value
will used in mime-type description. Something like
if (string at 0 = "PK\x03\x04" and subtype == 10)
then mimetype = "application/java-archive"
--
With best wishes, Alex Ott, MBA
http://alexott.blogspot.com/ http://alexott.net
http://alexott-ru.blogspot.com/