[ 
https://issues.apache.org/jira/browse/TIKA-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906524#action_12906524
 ] 

Nick Burch commented on TIKA-486:
---------------------------------

Thinking about it some more, these non Microsoft files which use OLE2 are going 
to be equally as tricky to reliably spot with only magic number detection. Just 
as with the microsoft formats, you can't predict where in the OLE2 file the key 
blocks will fall, so it's very hard to spot the magic numbers as they could be 
anywhere

I think the real solution is to update the OLE2 container aware detector to 
know about the entries in these files, so it can handle them correctly. I'm 
going to go ahead and do this shortly

> ContainerAwareDetector doesn't support non-MSOffice files which use the same 
> magic
> ----------------------------------------------------------------------------------
>
>                 Key: TIKA-486
>                 URL: https://issues.apache.org/jira/browse/TIKA-486
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Antoni Mylka
>         Attachments: test-documents.zip, 
> tika-non-office-files-with-office-magic.patch
>
>
> There are many applications which use the MSOffice magic number. I know of 
> Corel Presentations X3, Corel Quattro Pro 7 and X3 and Microsoft Works Word 
> Processor. They have their own mime types. 
> They aren't properly supported by POI though which means that if the 
> ContentAwareDetector finds such a file, it will resort to the 
> POIFSContainerDetector and return the basic application/x-tika-msoffice file 
> type because POI won't be able to say anything more specific. This will 
> happen even in situations when the fallback detector might come up with a 
> better answer.
> That's why IMHO the fallback detector should be used if the 
> POIFSContainerDetector returns x-tika-msoffice. If the fallback detector 
> comes up with a more specific type - the more specific one should be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to