Tim Allison commented on TIKA-2077:

In the xml, there AAAAAAA... exists as a paragraph within the text box:
<w:pStyle w:val="FrameContents"/>

Is there any signal that we should skip that section?

When I save the file as text, the entire text box disappears.  When I save it 
as html, the AAAAA... appears:
 <span style='mso-ignore:vglayout;position:
width=718 height=33 src="TestData_files/image001.gif"
alt="Text Box: TEST:  &#13;&#10;AAAAAAAAAAAAAA&#13;&#10;&#13;&#10;&#13;&#10;"

> Special character extracted as AAAAAAAA in docx file extraction
> ---------------------------------------------------------------
>                 Key: TIKA-2077
>                 URL: https://issues.apache.org/jira/browse/TIKA-2077
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.13
>            Reporter: Akash Sudhakar
>         Attachments: TestData.docx
> During docx file extraction using tika 1.13, special character is extracted 
> How to avoid this.

This message was sent by Atlassian JIRA

Reply via email to