[
https://issues.apache.org/jira/browse/TIKA-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15506438#comment-15506438
]
Tim Allison edited comment on TIKA-2077 at 9/20/16 2:12 PM:
------------------------------------------------------------
In the xml, the AAAAAAA... exists as a paragraph within the text box:
{noformat}
<w:p>
<w:pPr>
<w:pStyle w:val="FrameContents"/>
<w:rPr></w:rPr>
</w:pPr>
<w:r><w:rPr></w:rPr><w:t>AAAAAAAAAAAAAA</w:t></w:r>
</w:p>
{noformat}
Is there any signal that we should skip that section?
When I save the file as text, the entire text box disappears. When I save it
as html, the AAAAA... appears:
{noformat}
<span style='mso-ignore:vglayout;position:
relative;z-index:251657728;left:-98px;top:47px;width:1795px;height:130px'><img
width=718 height=33 src="TestData_files/image001.gif"
alt="Text Box: TEST: AAAAAAAAAAAAAA "
v:shapes="_x0000_s1026"></span>
{noformat}
was (Author: [email protected]):
In the xml, there AAAAAAA... exists as a paragraph within the text box:
{noformat}
<w:p>
<w:pPr>
<w:pStyle w:val="FrameContents"/>
<w:rPr></w:rPr>
</w:pPr>
<w:r><w:rPr></w:rPr><w:t>AAAAAAAAAAAAAA</w:t></w:r>
</w:p>
{noformat}
Is there any signal that we should skip that section?
When I save the file as text, the entire text box disappears. When I save it
as html, the AAAAA... appears:
{noformat}
<span style='mso-ignore:vglayout;position:
relative;z-index:251657728;left:-98px;top:47px;width:1795px;height:130px'><img
width=718 height=33 src="TestData_files/image001.gif"
alt="Text Box: TEST: AAAAAAAAAAAAAA "
v:shapes="_x0000_s1026"></span>
{noformat}
> Special character extracted as AAAAAAAA in docx file extraction
> ---------------------------------------------------------------
>
> Key: TIKA-2077
> URL: https://issues.apache.org/jira/browse/TIKA-2077
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.13
> Reporter: Akash Sudhakar
> Attachments: TestData.docx
>
>
> During docx file extraction using tika 1.13, special character is extracted
> as AAAAAAAA.
> How to avoid this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)