Tim Allison resolved TIKA-2606.
    Resolution: Not A Problem

[~almson], thank you for opening this and sharing a triggering document.  The 
extra text you're getting is 1) an embedded emf thumbnail of the content and 
then 2) a wmf inside that emf that also contains your content.  If you only 
want the container file, you can do something like this:
ParseContext parseContext = new ParseContext();
parseContext.set(Parser.class, new EmptyParser());{noformat}

Be careful, though, because this will turn off parsing of all embedded 
files/attachments.  Another option would be to use the RecursiveParserWrapper 
and iterate through the metadata objects to ignore the ones you don't care 
about based on their mime {{image/emf}} and/or {{image/wmf}}.  Or, you could do 
something like this:

        parseContext.set(DocumentSelector.class, new DocumentSelector() {
            public boolean select(Metadata metadata) {
                //do something more robust with nulls, etc, here
                if (metadata.get(Metadata.CONTENT_TYPE).equals("image/emf")) {
                    return false;
                return true;

The limitation of this is that there may be valid attachments stored within an 

I hope this helps.

> Tika.parseToString of particular docx results in duplicate text
> ---------------------------------------------------------------
>                 Key: TIKA-2606
>                 URL: https://issues.apache.org/jira/browse/TIKA-2606
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.17
>            Reporter: Aleksandr Dubinsky
>            Priority: Major
>         Attachments: TalkingTeachingInquiryforInnovation.docx
> Attached is a file that is not parsed correctly. Text is duplicated when read 
> with Tika.parseToString. In the output, the text of the document appears 
> first, then a corrupted copy of the document, then another copy of the 
> document.

This message was sent by Atlassian JIRA

Reply via email to