[ 
https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13704693#comment-13704693
 ] 

Nick Burch commented on TIKA-1130:
----------------------------------

Daniel - the simplest way to check would be for you to do a svn checkout of 
Tika, build a snapshot of the Tika App, and try with that. If you problem goes 
away, you know it was this and is fixed. If it still remains, you'll probably 
want to open up a fresh bug and also try to identify what the kind of text is 
that Tika ignores.
                
> .docx text extract leaves out some portions of text
> ---------------------------------------------------
>
>                 Key: TIKA-1130
>                 URL: https://issues.apache.org/jira/browse/TIKA-1130
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.2, 1.3
>         Environment: OpenJDK x86_64
>            Reporter: Daniel Gibby
>            Priority: Critical
>             Fix For: 1.5
>
>         Attachments: OwenResume.docx, Resume 6.4.13.docx, tee internal 
> resme.docx, TIKA-1130.patch, TIKA-1130.patch
>
>
> When parsing a Microsoft Word .docx 
> (application/vnd.openxmlformats-officedocument.wordprocessingml.document), 
> certain portions of text remain unextracted.
> I have attached a .docx file that can be tested against. The 'gray' portions 
> of text are what are not extracted, while the darker colored text extracts 
> fine.
> Looking at the document.xml portion of the .docx zip file shows the text is 
> all there.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to