[ 
https://issues.apache.org/jira/browse/TIKA-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336040#comment-16336040
 ] 

Tim Allison commented on TIKA-2551:
-----------------------------------

This goes back to 
[2011|https://svn.apache.org/viewvc/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java?r1=1153096&r2=1153097&;]
 YIKES!  

> TIka Server uses HtmlParser for XML no matter what config is given, even if 
> XML is disabled in Config
> -----------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-2551
>                 URL: https://issues.apache.org/jira/browse/TIKA-2551
>             Project: Tika
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 1.17
>            Reporter: Nick Burch
>            Priority: Major
>
> For some reason, the Tika Server has this line in TikaResource.java
> {code}
> parsers.put(MediaType.APPLICATION_XML, new HtmlParser());
> {code}
> The upshot of which is that the Tika Server (only) will always use the 
> HtmlParser for XML files, no matter what is configured in the Tika Config. If 
> you disable XML in the Tika Config, or assign it to a different parser, this 
> will be silently ignored
> To test, run the Tika Server with the {{TIKA-866-valid.xml}} test file from 
> {{tika-core/src/test/resources/org/apache/tika/config}} which uses the 
> EmptyParser for everything. If you ask the server what parsers it has, it 
> correctly reports none at http://localhost:9998/parsers . If you give it an 
> XML file, you'd expect it to fall through to the fallback parser (or possibly 
> empty parser). Instead, it gets processed as html, which is completely 
> unexpected!
> Originally discovered via 
> https://stackoverflow.com/questions/48391615/tell-tika-not-to-parse-xml



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to