[ https://issues.apache.org/jira/browse/TIKA-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500168#comment-17500168 ]
Tim Allison commented on TIKA-3684: ----------------------------------- We could also parameterize the WMF and EMF parsers to turn off text extraction. It _feels_ like wmf+emf used to be used for new information, images, etc within a page, but more recently, I'm seeing it being used as a thumbnail. Another option for improvement would be to allow configuration of the embedded parser to skip thumbnails. This emf is correctly identified as a thumbnail in its metadata in the /rmeta output. I cannot guarantee that all emf/wmf will be identified as such, though. > Extract text returns the text multiple times > -------------------------------------------- > > Key: TIKA-3684 > URL: https://issues.apache.org/jira/browse/TIKA-3684 > Project: Tika > Issue Type: Bug > Components: docker > Affects Versions: 2.1.0 > Reporter: Naama Hophstatder > Priority: Major > Attachments: example.docx, example.json > > > We are using tika docker container as a linux service, when I want to extract > text from a word document, e.g.: > curl -T example.docx http://localhost:9998/tika --header "Accept: text/plain" > we get the text 3 times. > Notice: We also have tika server v1.14, and this version returns the text > just as expected. -- This message was sent by Atlassian Jira (v8.20.1#820001)