[ 
https://issues.apache.org/jira/browse/TIKA-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16919482#comment-16919482
 ] 

Tim Allison commented on TIKA-2932:
-----------------------------------

Word doesn't always store/rarely stores that information in the document.  I 
hadn't looked for a correlation with images, but removing those tags will not 
affect the metadata that is stored in the document.

If you open the file in Word, you will probably see stats, but those are 
calculated dynamically by the application and may or may not be stored inside 
the document.

> Filter Documents Meta Data
> --------------------------
>
>                 Key: TIKA-2932
>                 URL: https://issues.apache.org/jira/browse/TIKA-2932
>             Project: Tika
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 1.22
>            Reporter: Thomas
>            Priority: Minor
>              Labels: newbie
>
> Hello!
> Is there a way so that I can filter out tags like , *[image: ]* [bookmark] 
> from the text I get while parsing the Docs? I need it because sometimes the 
> Metadata does not returns number of words from a document if it contains 
> images or tables
> *MetaData*
> {"title":"Complete 
> name,","description":null,"keywords":[],"language":"en","encoding":null,"author":"","generator":"Microsoft
>  Office Word","pages":0,"words":0 ...
> *Text*
> [image: ] Certified Translation Certificate of Accuracy Your name here 
> Translator/Interpreter Translated document: [bookmark: _GoBack]As a 
> translator for Your Spanish Translation, Inc., I, Your name here, declare 
> that I am a bilingual translator who is thoroughly familiar with the English 
> and source language languages. I have translated the attached document to the 
> best of my knowledge from source language into English and the English text 
> is an accurate and true translation of the original document presented to the 
> best of my knowledge and belief. Signed on June 1, 201 Sign here in blue ink 
> Your name here Professional Translator for Day Translations, Inc. [bookmark: 
> _MailAutoSig]
> Please help!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to