[ 
https://issues.apache.org/jira/browse/TIKA-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17725568#comment-17725568
 ] 

Gregory Lepore commented on TIKA-3999:
--------------------------------------

There is a chance of collisions with other magic numbers, but at the time I 
created most of the above there was no collision with anything in my test 
environment, which covered all formats documented in PRONOM 
([https://www.nationalarchives.gov.uk/PRONOM/Default.aspx)] and most of what 
`file` identifies, plus around 500,000 unidentified formats.

 

That being said, I am glad to hear that your regression suite with verify the 
above.

 

Is it possible to use the shorter magic numbers in addition to a specific file 
extension to limit misidentifications? I don't know what the process is for 
format identification in Tika...

 

Thanks, I can put together the magic numbers for the remaining tracker modules 
I've documented and add them.

> audio/xm audio/x-mod
> --------------------
>
>                 Key: TIKA-3999
>                 URL: https://issues.apache.org/jira/browse/TIKA-3999
>             Project: Tika
>          Issue Type: Sub-task
>            Reporter: Tim Allison
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to