[ 
https://issues.apache.org/jira/browse/TIKA-2986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976851#comment-16976851
 ] 

Nick Burch commented on TIKA-2986:
----------------------------------

Based on [https://cwiki.apache.org/confluence/display/TIKA/ErrorsAndExceptions] 
I think it's probably OK to give our best-guess on the type even if it turns 
out to be wrong, and it's up to the parser to spot it and error out.

In terms of "must be at the start", I invite you to check the mime types file 
history for all the things that started off at offset 0, and now have a 0:512 
or worse on them... Lots of formats and parsers cope with files with random 
stuff padded on the front!

The "best" fix is for you to figure out what this {{.fdf}} file really is, and 
add a mimetype with magic for that ;)

> Edge case (?) in file type detection
> ------------------------------------
>
>                 Key: TIKA-2986
>                 URL: https://issues.apache.org/jira/browse/TIKA-2986
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Trivial
>
> I recently came across a file that was identified as an Acrobat fdf file.  
> The particular file was some kind of binary file with a ".fdf" extension, but 
> not an Acrobat fdf.  
> Our current MimeTypes algorithm runs magic first, and then it tries to use 
> the file extension.  If the file extension suggests a child mime type of what 
> was found via magic, that is used.  The problem with this file was that the 
> magic {{%FDF-}} was not found, so from the magic step, it was 
> {{application/octet}}, and then the file extension, which was ".fdf", was 
> selected because {{application/vnd.fdf}} is a child of {{application/octet}}.
> If feels like we might want to add a rule that if a mime definition has a 
> defined magic and that magic is not found, we should not then fall back to 
> the file extension. Or, is there a better way to prevent this from happening? 
> Or, is this just an edge case that we should ignore?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to