[ 
https://issues.apache.org/jira/browse/TIKA-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341674#comment-16341674
 ] 

Hudson commented on TIKA-2551:
------------------------------

SUCCESS: Integrated in Jenkins build Tika-trunk #1426 (See 
[https://builds.apache.org/job/Tika-trunk/1426/])
TIKA-2551: No longer hardcode HtmlParser for XML files in tika-server. 
(tallison: 
[https://github.com/apache/tika/commit/066e60d5d6de8d51124c297410e7a4eca787d143])
* (edit) CHANGES.txt
* (edit) 
tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java


> TIka Server uses HtmlParser for XML no matter what config is given, even if 
> XML is disabled in Config
> -----------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-2551
>                 URL: https://issues.apache.org/jira/browse/TIKA-2551
>             Project: Tika
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 1.17
>            Reporter: Nick Burch
>            Priority: Major
>             Fix For: 2.0
>
>
> For some reason, the Tika Server has this line in TikaResource.java
> {code}
> parsers.put(MediaType.APPLICATION_XML, new HtmlParser());
> {code}
> The upshot of which is that the Tika Server (only) will always use the 
> HtmlParser for XML files, no matter what is configured in the Tika Config. If 
> you disable XML in the Tika Config, or assign it to a different parser, this 
> will be silently ignored
> To test, run the Tika Server with the {{TIKA-866-valid.xml}} test file from 
> {{tika-core/src/test/resources/org/apache/tika/config}} which uses the 
> EmptyParser for everything. If you ask the server what parsers it has, it 
> correctly reports none at http://localhost:9998/parsers . If you give it an 
> XML file, you'd expect it to fall through to the fallback parser (or possibly 
> empty parser). Instead, it gets processed as html, which is completely 
> unexpected!
> Originally discovered via 
> https://stackoverflow.com/questions/48391615/tell-tika-not-to-parse-xml



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to