[ 
https://issues.apache.org/jira/browse/TIKA-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377542#comment-17377542
 ] 

Tim Allison edited comment on TIKA-3466 at 7/8/21, 6:04 PM:
------------------------------------------------------------

We need to do as much as we can on Tika to get file detection correct.  

That said, I worry about letting a browser "execute" untrusted/user-supplied 
files without much greater controls in place.

The other issue is that polyglots are an issue in this kind of use case, and we 
only pick "the best" file type, we don't currently identify files that can be 
both a PDF and zip file, for example.  This tool is still getting off the 
ground, but maybe something like this would be better: 
https://github.com/trailofbits/polyfile ?

To confirm, you want to allow (and execute) XML in the browser but not XHTML or 
html?  Are there other file types that you want to exclude (e.g. pdf, jpeg)?


was (Author: [email protected]):
We need to do as much as we can on Tika to get file detection correct.  

That said, I worry about letting a browser "execute" untrusted/user-supplied 
files without much great controls in place.

To confirm, you want to allow (and execute) XML in the browser but not XHTML or 
html?  Are there other file types that you want to exclude (e.g. pdf, jpeg)?

> Cannot detect mimetype of xhtml file when script is first node instead of html
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-3466
>                 URL: https://issues.apache.org/jira/browse/TIKA-3466
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, mime
>    Affects Versions: 1.27
>            Reporter: Packiaraj Sakkanan
>            Priority: Major
>
> mime-type of below xhtml file deduced as 'application/xml' instead of 
> 'application/xhtml+xml' 
> {code:java}
> <?xml version="1.0" encoding="UTF-8" ?>
> <script xmlns="http://www.w3.org/1999/xhtml";><![CDATA[
>   alert(555);
>   ]]></script>
> {code}
>  
>  one possible solution is to add 'script' in tika-mimetypes.xml, like 
> {code:java}
> <mime-type type="application/xhtml+xml">
>   <!-- The magic priority for xhtml+xml needs to be lower than that of -->
>   <!--  files that contain HTML within them, e.g. mime emails -->
>   <magic priority="40">
>     <match value="&lt;html xmlns=" type="string" offset="0:8192"/>
>   </magic>
>   <root-XML namespaceURI="http://www.w3.org/1999/xhtml"; localName="html"/>
>   <root-XML namespaceURI="http://www.w3.org/1999/xhtml"; localName="script"/>
>   <glob pattern="*.xhtml"/>
>   <glob pattern="*.xht"/>
> </mime-type>
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to