[
https://issues.apache.org/jira/browse/TIKA-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17349322#comment-17349322
]
Nick Burch commented on TIKA-3411:
----------------------------------
The current Tika matching logic is:
* If we have only a filename, match on the extension
* If we have only the bytes, match on magic
* If we have bytes + filename, match initially on magic, then use the filename
to specialise (eg Zip magic + .xlsx -> Excel)
I suspect it's a bit late to change, but we find a handful of non-ascii bytes
at the start (or at least a fixed offset near the start) tends to make the
safest magic for matching. The shorter the magic, and the more "normal ascii"
bytes included, the more the chance of a false match
Tim has already volunteered to trawl the corpra, so we can hopefully confirm
how many % of the internet will be affected by the short form matching. Don't
forget, 1% of the internet is still a lot of files.... :)
> Add image/jxl
> -------------
>
> Key: TIKA-3411
> URL: https://issues.apache.org/jira/browse/TIKA-3411
> Project: Tika
> Issue Type: Wish
> Components: mime
> Reporter: Jon Sneyers
> Priority: Major
>
> image/jxl is the media type for JPEG XL (ISO/IEC 18181).
> Conventional filename extension is .jxl
> It is quite straightforward to detect based on magic: there are two possible
> header bytes:
> {{FF 0A}}
> or
> {{00 00 00 0C 4A 58 4C 20 0D 0A 87 0A}}
> {{}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)