[ 
https://issues.apache.org/jira/browse/TIKA-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920603#comment-16920603
 ] 

Thomas commented on TIKA-2932:
------------------------------

[[email protected]], Thanks a lot :) I understand now that word does not 
always store the information.

I also wanted to know if there is a way to sanitize the text that I am getting 
so that it contains only text no image, bookmarks or other data?

> Filter Documents Meta Data
> --------------------------
>
>                 Key: TIKA-2932
>                 URL: https://issues.apache.org/jira/browse/TIKA-2932
>             Project: Tika
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 1.22
>            Reporter: Thomas
>            Priority: Minor
>              Labels: newbie
>
> Hello!
> Is there a way so that I can filter out tags like , *[image: ]* [bookmark] 
> from the text I get while parsing the Docs? I need it because sometimes the 
> Metadata does not returns number of words from a document if it contains 
> images or tables
> *MetaData*
> {"title":"Complete 
> name,","description":null,"keywords":[],"language":"en","encoding":null,"author":"","generator":"Microsoft
>  Office Word","pages":0,"words":0 ...
> *Text*
> [image: ] Certified Translation Certificate of Accuracy Your name here 
> Translator/Interpreter Translated document: [bookmark: _GoBack]As a 
> translator for Your Spanish Translation, Inc., I, Your name here, declare 
> that I am a bilingual translator who is thoroughly familiar with the English 
> and source language languages. I have translated the attached document to the 
> best of my knowledge from source language into English and the English text 
> is an accurate and true translation of the original document presented to the 
> best of my knowledge and belief. Signed on June 1, 201 Sign here in blue ink 
> Your name here Professional Translator for Day Translations, Inc. [bookmark: 
> _MailAutoSig]
> Please help!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to