Tim Allison created TIKA-2986:
---------------------------------

             Summary: Edge case (?) in file type detection
                 Key: TIKA-2986
                 URL: https://issues.apache.org/jira/browse/TIKA-2986
             Project: Tika
          Issue Type: Improvement
            Reporter: Tim Allison


I recently came across a file that was identified as an Acrobat fdf file.  The 
particular file was some kind of binary file with a ".fdf" extension, but not 
an Acrobat fdf.  

Our current MimeTypes algorithm runs magic first, and then it tries to use the 
file extension.  If the file extension suggests a child mime type of what was 
found via magic, that is used.  The problem with this file was that the magic 
{{%FDF-}} was not found, so from the magic step, it was {{application/octet}}, 
and then the file extension, which was ".fdf", was selected because 
{{application/vnd.fdf}} is a child of {{application/octet}}.

If feels like we might want to add a rule that if a mime definition has a 
defined magic and that magic is not found, we should not then fall back to the 
file extension. Or, is there a better way to prevent this from happening? Or, 
is this just an edge case that we should ignore?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to