[jira] [Resolved] (TIKA-2606) Tika.parseToString of particular docx results in duplicate text

Tim Allison (JIRA) Tue, 13 Mar 2018 10:49:26 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tim Allison resolved TIKA-2606.
-------------------------------
    Resolution: Not A Problem

[~almson], thank you for opening this and sharing a triggering document.  The 
extra text you're getting is 1) an embedded emf thumbnail of the content and 
then 2) a wmf inside that emf that also contains your content.  If you only 
want the container file, you can do something like this:
{noformat}
ParseContext parseContext = new ParseContext();
parseContext.set(Parser.class, new EmptyParser());{noformat}

Be careful, though, because this will turn off parsing of all embedded 
files/attachments.  Another option would be to use the RecursiveParserWrapper 
and iterate through the metadata objects to ignore the ones you don't care 
about based on their mime {{image/emf}} and/or {{image/wmf}}.  Or, you could do 
something like this:

{noformat}
        parseContext.set(DocumentSelector.class, new DocumentSelector() {
            @Override
            public boolean select(Metadata metadata) {
                //do something more robust with nulls, etc, here
                if (metadata.get(Metadata.CONTENT_TYPE).equals("image/emf")) {
                    return false;
                }
                return true;
            }
        });
{noformat}

The limitation of this is that there may be valid attachments stored within an 
emf.

I hope this helps.

> Tika.parseToString of particular docx results in duplicate text
> ---------------------------------------------------------------
>
>                 Key: TIKA-2606
>                 URL: https://issues.apache.org/jira/browse/TIKA-2606
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.17
>            Reporter: Aleksandr Dubinsky
>            Priority: Major
>         Attachments: TalkingTeachingInquiryforInnovation.docx
>
>
> Attached is a file that is not parsed correctly. Text is duplicated when read 
> with Tika.parseToString. In the output, the text of the document appears 
> first, then a corrupted copy of the document, then another copy of the 
> document.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (TIKA-2606) Tika.parseToString of particular docx results in duplicate text

Reply via email to