[
https://issues.apache.org/jira/browse/TIKA-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377317#comment-17377317
]
Nick Burch commented on TIKA-3466:
----------------------------------
I'm happy to add the xmlns version as a match, that seems pretty distinctive
However, I don't think this will fix your underlying issue. As others have
said, Tika will tell you what it thinks a document is, falling back to more
general types if it isn't sure. Deny-listing a handful of "unwanted" types is
not a good plan, Tika may well miss many cases of these, especially in
adversarial environments. If you are worried, you should instead do allow-list,
checking for only wanted typed. This still needs a manual review option though,
for valid files that Tika fails to correctly detect
> Cannot detect mimetype of xhtml file when script is first node instead of html
> ------------------------------------------------------------------------------
>
> Key: TIKA-3466
> URL: https://issues.apache.org/jira/browse/TIKA-3466
> Project: Tika
> Issue Type: Bug
> Components: detector, mime
> Affects Versions: 1.27
> Reporter: Packiaraj Sakkanan
> Priority: Major
>
> mime-type of below xhtml file deduced as 'application/xml' instead of
> 'application/xhtml+xml'
> {code:java}
> <?xml version="1.0" encoding="UTF-8" ?>
> <script xmlns="http://www.w3.org/1999/xhtml"><![CDATA[
> alert(555);
> ]]></script>
> {code}
>
> one possible solution is to add 'script' in tika-mimetypes.xml, like
> {code:java}
> <mime-type type="application/xhtml+xml">
> <!-- The magic priority for xhtml+xml needs to be lower than that of -->
> <!-- files that contain HTML within them, e.g. mime emails -->
> <magic priority="40">
> <match value="<html xmlns=" type="string" offset="0:8192"/>
> </magic>
> <root-XML namespaceURI="http://www.w3.org/1999/xhtml" localName="html"/>
> <root-XML namespaceURI="http://www.w3.org/1999/xhtml" localName="script"/>
> <glob pattern="*.xhtml"/>
> <glob pattern="*.xht"/>
> </mime-type>
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)