[ 
https://issues.apache.org/jira/browse/TIKA-2986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16977499#comment-16977499
 ] 

Tim Allison edited comment on TIKA-2986 at 11/19/19 2:36 PM:
-------------------------------------------------------------

>Nick Burch, probably the .fdf found by Tim is nothing

One of my colleagues found one alternative: 
http://support.digilite.eu/fixtures/PULSEMX/Deliya/Ferrari.fdf

Opened TIKA-2988...

>But it is a dramatic change in behaviour and has to be carefully tested...

Our regression corpus won't be of much use here because I've intentionally 
dropped file extensions.  

-Let me at least do a count of mime types w/ magic-
If we iterate through the mimes that have a glob or other name pattern, I find 
224 that also have a magic, and 638 that lack a magic.


was (Author: [email protected]):
>Nick Burch, probably the .fdf found by Tim is nothing

One of my colleagues found one alternative: 
http://support.digilite.eu/fixtures/PULSEMX/Deliya/Ferrari.fdf

Opened TIKA-2988...

>But it is a dramatic change in behaviour and has to be carefully tested...

Our regression corpus won't be of much use here because I've intentionally 
dropped file extensions.  Let me at least do a count of mime types w/ magic.

> Edge case (?) in file type detection
> ------------------------------------
>
>                 Key: TIKA-2986
>                 URL: https://issues.apache.org/jira/browse/TIKA-2986
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Trivial
>
> I recently came across a file that was identified as an Acrobat fdf file.  
> The particular file was some kind of binary file with a ".fdf" extension, but 
> not an Acrobat fdf.  
> Our current MimeTypes algorithm runs magic first, and then it tries to use 
> the file extension.  If the file extension suggests a child mime type of what 
> was found via magic, that is used.  The problem with this file was that the 
> magic {{%FDF-}} was not found, so from the magic step, it was 
> {{application/octet}}, and then the file extension, which was ".fdf", was 
> selected because {{application/vnd.fdf}} is a child of {{application/octet}}.
> If feels like we might want to add a rule that if a mime definition has a 
> defined magic and that magic is not found, we should not then fall back to 
> the file extension. Or, is there a better way to prevent this from happening? 
> Or, is this just an edge case that we should ignore?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to