[
https://issues.apache.org/jira/browse/TIKA-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376771#comment-17376771
]
Kenneth William Krugler commented on TIKA-3466:
-----------------------------------------------
Browsers do all kinds of helicopter stunts to try to handle broken (X)HTML, so
there's no way to support "xhtml check criteria should match with browser
criteria".
Also note that if the concern is security (must identify as XHTML any file that
a browser could render as such), then added in the {{<script>}} tag to XHTML
mime-type check could easily be circumvented by inserting some other tag.
I could imagine running Tika in some "super careful" mode, where it assigns
mime-type based on successful parsing by the most restrictive type that seems
possible (so if whatever HTML parser we're using is able to parse the file,
assume it's HTML. But that would be much, much slower. And support for that
would be non-trivial.
> Cannot detect mimetype of xhtml file when script is first node instead of html
> ------------------------------------------------------------------------------
>
> Key: TIKA-3466
> URL: https://issues.apache.org/jira/browse/TIKA-3466
> Project: Tika
> Issue Type: Bug
> Components: detector, mime
> Affects Versions: 1.27
> Reporter: Packiaraj Sakkanan
> Priority: Major
>
> mime-type of below xhtml file deduced as 'application/xml' instead of
> 'application/xhtml+xml'
> {code:java}
> <?xml version="1.0" encoding="UTF-8" ?>
> <script xmlns="http://www.w3.org/1999/xhtml"><![CDATA[
> alert(555);
> ]]></script>
> {code}
>
> one possible solution is to add 'script' in tika-mimetypes.xml, like
> {code:java}
> <mime-type type="application/xhtml+xml">
> <!-- The magic priority for xhtml+xml needs to be lower than that of -->
> <!-- files that contain HTML within them, e.g. mime emails -->
> <magic priority="40">
> <match value="<html xmlns=" type="string" offset="0:8192"/>
> </magic>
> <root-XML namespaceURI="http://www.w3.org/1999/xhtml" localName="html"/>
> <root-XML namespaceURI="http://www.w3.org/1999/xhtml" localName="script"/>
> <glob pattern="*.xhtml"/>
> <glob pattern="*.xht"/>
> </mime-type>
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)