[ 
https://issues.apache.org/jira/browse/TIKA-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17349322#comment-17349322
 ] 

Nick Burch commented on TIKA-3411:
----------------------------------

The current Tika matching logic is:
* If we have only a filename, match on the extension
* If we have only the bytes, match on magic
* If we have bytes + filename, match initially on magic, then use the filename 
to specialise (eg Zip magic + .xlsx -> Excel)

I suspect it's a bit late to change, but we find a handful of non-ascii bytes 
at the start (or at least a fixed offset near the start) tends to make the 
safest magic for matching. The shorter the magic, and the more "normal ascii" 
bytes included, the more the chance of a false match

Tim has already volunteered to trawl the corpra, so we can hopefully confirm 
how many % of the internet will be affected by the short form matching. Don't 
forget, 1% of the internet is still a lot of files.... :)

> Add image/jxl
> -------------
>
>                 Key: TIKA-3411
>                 URL: https://issues.apache.org/jira/browse/TIKA-3411
>             Project: Tika
>          Issue Type: Wish
>          Components: mime
>            Reporter: Jon Sneyers
>            Priority: Major
>
> image/jxl is the media type for JPEG XL (ISO/IEC 18181).
> Conventional filename extension is .jxl
> It is quite straightforward to detect based on magic: there are two possible 
> header bytes:
> {{FF 0A}} 
> or
>  {{00 00 00 0C 4A 58 4C 20 0D 0A 87 0A}}
> {{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to