[jira] [Comment Edited] (TIKA-2077) Special character extracted as AAAAAAAA in docx file extraction

Tim Allison (JIRA) Tue, 20 Sep 2016 07:14:18 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15506438#comment-15506438
 ]


Tim Allison edited comment on TIKA-2077 at 9/20/16 2:12 PM:
------------------------------------------------------------

In the xml, the AAAAAAA... exists as a paragraph within the text box:
{noformat}
<w:p>
<w:pPr>
<w:pStyle w:val="FrameContents"/>
<w:rPr></w:rPr>
</w:pPr>
<w:r><w:rPr></w:rPr><w:t>AAAAAAAAAAAAAA</w:t></w:r>
</w:p>
{noformat}

Is there any signal that we should skip that section?

When I save the file as text, the entire text box disappears.  When I save it 
as html, the AAAAA... appears:
{noformat}
 <span style='mso-ignore:vglayout;position:
relative;z-index:251657728;left:-98px;top:47px;width:1795px;height:130px'><img
width=718 height=33 src="TestData_files/image001.gif"
alt="Text Box: TEST:  &#13;&#10;AAAAAAAAAAAAAA&#13;&#10;&#13;&#10;&#13;&#10;"
v:shapes="_x0000_s1026"></span>
{noformat}


was (Author: talli...@mitre.org):
In the xml, there AAAAAAA... exists as a paragraph within the text box:
{noformat}
<w:p>
<w:pPr>
<w:pStyle w:val="FrameContents"/>
<w:rPr></w:rPr>
</w:pPr>
<w:r><w:rPr></w:rPr><w:t>AAAAAAAAAAAAAA</w:t></w:r>
</w:p>
{noformat}

Is there any signal that we should skip that section?

When I save the file as text, the entire text box disappears.  When I save it 
as html, the AAAAA... appears:
{noformat}
 <span style='mso-ignore:vglayout;position:
relative;z-index:251657728;left:-98px;top:47px;width:1795px;height:130px'><img
width=718 height=33 src="TestData_files/image001.gif"
alt="Text Box: TEST:  &#13;&#10;AAAAAAAAAAAAAA&#13;&#10;&#13;&#10;&#13;&#10;"
v:shapes="_x0000_s1026"></span>
{noformat}

> Special character extracted as AAAAAAAA in docx file extraction
> ---------------------------------------------------------------
>
>                 Key: TIKA-2077
>                 URL: https://issues.apache.org/jira/browse/TIKA-2077
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.13
>            Reporter: Akash Sudhakar
>         Attachments: TestData.docx
>
>
> During docx file extraction using tika 1.13, special character is extracted 
> as AAAAAAAA.
> How to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-2077) Special character extracted as AAAAAAAA in docx file extraction

Reply via email to