[
https://issues.apache.org/jira/browse/TIKA-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336040#comment-16336040
]
Tim Allison commented on TIKA-2551:
-----------------------------------
This goes back to
[2011|https://svn.apache.org/viewvc/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java?r1=1153096&r2=1153097&]
YIKES!
> TIka Server uses HtmlParser for XML no matter what config is given, even if
> XML is disabled in Config
> -----------------------------------------------------------------------------------------------------
>
> Key: TIKA-2551
> URL: https://issues.apache.org/jira/browse/TIKA-2551
> Project: Tika
> Issue Type: Bug
> Components: server
> Affects Versions: 1.17
> Reporter: Nick Burch
> Priority: Major
>
> For some reason, the Tika Server has this line in TikaResource.java
> {code}
> parsers.put(MediaType.APPLICATION_XML, new HtmlParser());
> {code}
> The upshot of which is that the Tika Server (only) will always use the
> HtmlParser for XML files, no matter what is configured in the Tika Config. If
> you disable XML in the Tika Config, or assign it to a different parser, this
> will be silently ignored
> To test, run the Tika Server with the {{TIKA-866-valid.xml}} test file from
> {{tika-core/src/test/resources/org/apache/tika/config}} which uses the
> EmptyParser for everything. If you ask the server what parsers it has, it
> correctly reports none at http://localhost:9998/parsers . If you give it an
> XML file, you'd expect it to fall through to the fallback parser (or possibly
> empty parser). Instead, it gets processed as html, which is completely
> unexpected!
> Originally discovered via
> https://stackoverflow.com/questions/48391615/tell-tika-not-to-parse-xml
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)