[
https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-1130:
------------------------------
Attachment: TIKA-1130.patch
Many thanks to Ray for the unit test and to Nick for his guidance on the POI
patch and this Tika patch.
This is the first round patch for Tika to make use of the new SDT processing in
POI.
Ray's test case brought to light a formatting issue in POI 54849...we don't
want to insert a "\n" between two runs within an SDT. I'll submit a patch for
this in POI.
Let me know how this looks.
> .docx text extract leaves out some portions of text
> ---------------------------------------------------
>
> Key: TIKA-1130
> URL: https://issues.apache.org/jira/browse/TIKA-1130
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.2, 1.3
> Environment: OpenJDK x86_64
> Reporter: Daniel Gibby
> Priority: Critical
> Attachments: Resume 6.4.13.docx, TIKA-1130.patch
>
>
> When parsing a Microsoft Word .docx
> (application/vnd.openxmlformats-officedocument.wordprocessingml.document),
> certain portions of text remain unextracted.
> I have attached a .docx file that can be tested against. The 'gray' portions
> of text are what are not extracted, while the darker colored text extracts
> fine.
> Looking at the document.xml portion of the .docx zip file shows the text is
> all there.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira