Nick Burch created TIKA-2551:
--------------------------------

             Summary: TIka Server uses HtmlParser for XML no matter what config 
is given, even if XML is disabled in Config
                 Key: TIKA-2551
                 URL: https://issues.apache.org/jira/browse/TIKA-2551
             Project: Tika
          Issue Type: Bug
          Components: server
    Affects Versions: 1.17
            Reporter: Nick Burch


For some reason, the Tika Server has this line in TikaResource.java
{code}
parsers.put(MediaType.APPLICATION_XML, new HtmlParser());
{code}

The upshot of which is that the Tika Server (only) will always use the 
HtmlParser for XML files, no matter what is configured in the Tika Config. If 
you disable XML in the Tika Config, or assign it to a different parser, this 
will be silently ignored

To test, run the Tika Server with the {{TIKA-866-valid.xml}} test file from 
{{tika-core/src/test/resources/org/apache/tika/config}} which uses the 
EmptyParser for everything. If you ask the server what parsers it has, it 
correctly reports none at http://localhost:9998/parsers . If you give it an XML 
file, you'd expect it to fall through to the fallback parser (or possibly empty 
parser). Instead, it gets processed as html, which is completely unexpected!

Originally discovered via 
https://stackoverflow.com/questions/48391615/tell-tika-not-to-parse-xml



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to