[
https://issues.apache.org/jira/browse/TIKA-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison resolved TIKA-2606.
-------------------------------
Resolution: Not A Problem
[~almson], thank you for opening this and sharing a triggering document. The
extra text you're getting is 1) an embedded emf thumbnail of the content and
then 2) a wmf inside that emf that also contains your content. If you only
want the container file, you can do something like this:
{noformat}
ParseContext parseContext = new ParseContext();
parseContext.set(Parser.class, new EmptyParser());{noformat}
Be careful, though, because this will turn off parsing of all embedded
files/attachments. Another option would be to use the RecursiveParserWrapper
and iterate through the metadata objects to ignore the ones you don't care
about based on their mime {{image/emf}} and/or {{image/wmf}}. Or, you could do
something like this:
{noformat}
parseContext.set(DocumentSelector.class, new DocumentSelector() {
@Override
public boolean select(Metadata metadata) {
//do something more robust with nulls, etc, here
if (metadata.get(Metadata.CONTENT_TYPE).equals("image/emf")) {
return false;
}
return true;
}
});
{noformat}
The limitation of this is that there may be valid attachments stored within an
emf.
I hope this helps.
> Tika.parseToString of particular docx results in duplicate text
> ---------------------------------------------------------------
>
> Key: TIKA-2606
> URL: https://issues.apache.org/jira/browse/TIKA-2606
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.17
> Reporter: Aleksandr Dubinsky
> Priority: Major
> Attachments: TalkingTeachingInquiryforInnovation.docx
>
>
> Attached is a file that is not parsed correctly. Text is duplicated when read
> with Tika.parseToString. In the output, the text of the document appears
> first, then a corrupted copy of the document, then another copy of the
> document.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)