[ 
https://issues.apache.org/jira/browse/TIKA-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17349286#comment-17349286
 ] 

Jon Sneyers commented on TIKA-3411:
-----------------------------------

I'm not actually a user of Tika, but I am the chair of the JPEG XL adhoc group 
within JPEG (ISO/IEC JTC 1 SC 29 WG 1).

The 2-byte version is the one that is expected to be the most common one on the 
web. The longer one is the container format in case there is XMP/Exif metadata 
that needs to be attached, but on the web we expect this will usually be 
stripped and the compact, codestream-only header gets used. It does have a 
short header (only 2 bytes) exactly to avoid header overhead; obviously this 
does come at the cost of potentially more false-positive matches.

In JPEG, going back to JPEG-1, the convention for all codestream markers is to 
use 0xFF followed by a unique byte. The specific combination of 0xFF0A was 
assigned to "start of JPEG XL codestream". It is supposed to be a unique magic.

Text files with a BOM could start with 0xFF, but they shouldn't be able to 
start with 0xFF0A unless they're indeed broken.

If you would know any examples of false-positive matches (other than broken 
files), that would be greatly appreciated because it would also be relevant for 
many other applications (in particular browsers) that rely on magic sniffing.

Does Tika also consider filename extensions? It might be a good idea to try to 
avoid false positives by only considering 0xFF0A a match if the extension is 
.jxl, if false positives are a concern.

> Add image/jxl
> -------------
>
>                 Key: TIKA-3411
>                 URL: https://issues.apache.org/jira/browse/TIKA-3411
>             Project: Tika
>          Issue Type: Wish
>          Components: mime
>            Reporter: Jon Sneyers
>            Priority: Major
>
> image/jxl is the media type for JPEG XL (ISO/IEC 18181).
> Conventional filename extension is .jxl
> It is quite straightforward to detect based on magic: there are two possible 
> header bytes:
> {{FF 0A}} 
> or
>  {{00 00 00 0C 4A 58 4C 20 0D 0A 87 0A}}
> {{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to