[
https://issues.apache.org/jira/browse/TIKA-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376828#comment-17376828
]
Tim Allison commented on TIKA-3466:
-----------------------------------
At a high level, Tika does a pretty good job on files in the wild, but it is
not designed for file type identification in adversarial settings, e.g.
https://passthesalt.ubicast.tv/videos/2021-generating-weird-files/
> Cannot detect mimetype of xhtml file when script is first node instead of html
> ------------------------------------------------------------------------------
>
> Key: TIKA-3466
> URL: https://issues.apache.org/jira/browse/TIKA-3466
> Project: Tika
> Issue Type: Bug
> Components: detector, mime
> Affects Versions: 1.27
> Reporter: Packiaraj Sakkanan
> Priority: Major
>
> mime-type of below xhtml file deduced as 'application/xml' instead of
> 'application/xhtml+xml'
> {code:java}
> <?xml version="1.0" encoding="UTF-8" ?>
> <script xmlns="http://www.w3.org/1999/xhtml"><![CDATA[
> alert(555);
> ]]></script>
> {code}
>
> one possible solution is to add 'script' in tika-mimetypes.xml, like
> {code:java}
> <mime-type type="application/xhtml+xml">
> <!-- The magic priority for xhtml+xml needs to be lower than that of -->
> <!-- files that contain HTML within them, e.g. mime emails -->
> <magic priority="40">
> <match value="<html xmlns=" type="string" offset="0:8192"/>
> </magic>
> <root-XML namespaceURI="http://www.w3.org/1999/xhtml" localName="html"/>
> <root-XML namespaceURI="http://www.w3.org/1999/xhtml" localName="script"/>
> <glob pattern="*.xhtml"/>
> <glob pattern="*.xht"/>
> </mime-type>
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)