[ https://issues.apache.org/jira/browse/TIKA-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison resolved TIKA-2606. ------------------------------- Resolution: Not A Problem [~almson], thank you for opening this and sharing a triggering document. The extra text you're getting is 1) an embedded emf thumbnail of the content and then 2) a wmf inside that emf that also contains your content. If you only want the container file, you can do something like this: {noformat} ParseContext parseContext = new ParseContext(); parseContext.set(Parser.class, new EmptyParser());{noformat} Be careful, though, because this will turn off parsing of all embedded files/attachments. Another option would be to use the RecursiveParserWrapper and iterate through the metadata objects to ignore the ones you don't care about based on their mime {{image/emf}} and/or {{image/wmf}}. Or, you could do something like this: {noformat} parseContext.set(DocumentSelector.class, new DocumentSelector() { @Override public boolean select(Metadata metadata) { //do something more robust with nulls, etc, here if (metadata.get(Metadata.CONTENT_TYPE).equals("image/emf")) { return false; } return true; } }); {noformat} The limitation of this is that there may be valid attachments stored within an emf. I hope this helps. > Tika.parseToString of particular docx results in duplicate text > --------------------------------------------------------------- > > Key: TIKA-2606 > URL: https://issues.apache.org/jira/browse/TIKA-2606 > Project: Tika > Issue Type: Bug > Affects Versions: 1.17 > Reporter: Aleksandr Dubinsky > Priority: Major > Attachments: TalkingTeachingInquiryforInnovation.docx > > > Attached is a file that is not parsed correctly. Text is duplicated when read > with Tika.parseToString. In the output, the text of the document appears > first, then a corrupted copy of the document, then another copy of the > document. -- This message was sent by Atlassian JIRA (v7.6.3#76005)