Nick Burch created TIKA-2551:
--------------------------------
Summary: TIka Server uses HtmlParser for XML no matter what config
is given, even if XML is disabled in Config
Key: TIKA-2551
URL: https://issues.apache.org/jira/browse/TIKA-2551
Project: Tika
Issue Type: Bug
Components: server
Affects Versions: 1.17
Reporter: Nick Burch
For some reason, the Tika Server has this line in TikaResource.java
{code}
parsers.put(MediaType.APPLICATION_XML, new HtmlParser());
{code}
The upshot of which is that the Tika Server (only) will always use the
HtmlParser for XML files, no matter what is configured in the Tika Config. If
you disable XML in the Tika Config, or assign it to a different parser, this
will be silently ignored
To test, run the Tika Server with the {{TIKA-866-valid.xml}} test file from
{{tika-core/src/test/resources/org/apache/tika/config}} which uses the
EmptyParser for everything. If you ask the server what parsers it has, it
correctly reports none at http://localhost:9998/parsers . If you give it an XML
file, you'd expect it to fall through to the fallback parser (or possibly empty
parser). Instead, it gets processed as html, which is completely unexpected!
Originally discovered via
https://stackoverflow.com/questions/48391615/tell-tika-not-to-parse-xml
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)