[jira] [Comment Edited] (TIKA-2986) Edge case (?) in file type detection

Jira Mon, 18 Nov 2019 20:15:25 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-2986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16977050#comment-16977050
 ]


Luís Filipe Nassif edited comment on TIKA-2986 at 11/19/19 4:14 AM:
--------------------------------------------------------------------

Hi [~tallison],

It is not an edge case, I have found this situation several times processing 
our forensic corpus, formed by deleted and recovered data that is often 
corrupted or totally overwritten. I think like you, that it makes sense to just 
use extension detection for mimetypes that do not have registered magics, or to 
refine parent mimetypes previously detected based on magics.

But it is a dramatic change in behaviour and has to be carefully tested...

[~nick], probably the .fdf found by Tim is nothing, just garbage data, and will 
not have any magic, like the deleted and overwritten data my organization often 
needs to work with.


was (Author: lfcnassif):
Hi [~tallison],

It is not an edge case, I have found this situation several times processing 
our forensic corpus, formed by deleted and recovered data that is often 
corrupted or totally overwritten. I think exactly like you, that it makes sense 
to just use extension detection for mimetypes that do not have registered 
magics, or to refine parent mimetypes previously detected based on magics.

But it is a dramatic change in behaviour and has to be carefully tested...

[~nick], probably the .fdf found by Tim is nothing, just garbage data, and will 
not have any magic, like the deleted and overwritten data my organization often 
needs to work with.

> Edge case (?) in file type detection
> ------------------------------------
>
>                 Key: TIKA-2986
>                 URL: https://issues.apache.org/jira/browse/TIKA-2986
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Trivial
>
> I recently came across a file that was identified as an Acrobat fdf file.  
> The particular file was some kind of binary file with a ".fdf" extension, but 
> not an Acrobat fdf.  
> Our current MimeTypes algorithm runs magic first, and then it tries to use 
> the file extension.  If the file extension suggests a child mime type of what 
> was found via magic, that is used.  The problem with this file was that the 
> magic {{%FDF-}} was not found, so from the magic step, it was 
> {{application/octet}}, and then the file extension, which was ".fdf", was 
> selected because {{application/vnd.fdf}} is a child of {{application/octet}}.
> If feels like we might want to add a rule that if a mime definition has a 
> defined magic and that magic is not found, we should not then fall back to 
> the file extension. Or, is there a better way to prevent this from happening? 
> Or, is this just an edge case that we should ignore?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (TIKA-2986) Edge case (?) in file type detection

Reply via email to