Sebastian Hesse created TIKA-1975:
-------------------------------------
Summary: Different behaviour between tika-app and tika-server
Key: TIKA-1975
URL: https://issues.apache.org/jira/browse/TIKA-1975
Project: Tika
Issue Type: Bug
Components: general
Affects Versions: 1.12
Reporter: Sebastian Hesse
I am using tika-server to extract content from word files via DotNet.
For the extraction i use the following rest endpoint
(https://wiki.apache.org/tika/TikaJAXRS#Get_the_Text_of_a_Document).
If I extract the content of a DOCX file the content contains some hidden
bookmarks like: "[bookmark:_GoBack] hello world"
When i do the same with the tika-app via console i get "hello world"
I didn't find a way to prevent tika-server from extracting the hidden
bookmarks. Also specifying the mime-type did not work.
Here is a test file (only a few chars)
http://en.file-upload.net/download-11584028/ContentWord.docx.html
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)