[
https://issues.apache.org/jira/browse/TIKA-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920603#comment-16920603
]
Thomas commented on TIKA-2932:
------------------------------
[[email protected]], Thanks a lot :) I understand now that word does not
always store the information.
I also wanted to know if there is a way to sanitize the text that I am getting
so that it contains only text no image, bookmarks or other data?
> Filter Documents Meta Data
> --------------------------
>
> Key: TIKA-2932
> URL: https://issues.apache.org/jira/browse/TIKA-2932
> Project: Tika
> Issue Type: Wish
> Components: parser
> Affects Versions: 1.22
> Reporter: Thomas
> Priority: Minor
> Labels: newbie
>
> Hello!
> Is there a way so that I can filter out tags like , *[image: ]* [bookmark]
> from the text I get while parsing the Docs? I need it because sometimes the
> Metadata does not returns number of words from a document if it contains
> images or tables
> *MetaData*
> {"title":"Complete
> name,","description":null,"keywords":[],"language":"en","encoding":null,"author":"","generator":"Microsoft
> Office Word","pages":0,"words":0 ...
> *Text*
> [image: ] Certified Translation Certificate of Accuracy Your name here
> Translator/Interpreter Translated document: [bookmark: _GoBack]As a
> translator for Your Spanish Translation, Inc., I, Your name here, declare
> that I am a bilingual translator who is thoroughly familiar with the English
> and source language languages. I have translated the attached document to the
> best of my knowledge from source language into English and the English text
> is an accurate and true translation of the original document presented to the
> best of my knowledge and belief. Signed on June 1, 201 Sign here in blue ink
> Your name here Professional Translator for Day Translations, Inc. [bookmark:
> _MailAutoSig]
> Please help!
--
This message was sent by Atlassian Jira
(v8.3.2#803003)