[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text

Daniel Gibby (JIRA) Mon, 24 Jun 2013 08:30:04 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692077#comment-13692077
 ]


Daniel Gibby commented on TIKA-1130:
------------------------------------

I'd rather wait a week for a beta, but I need to test this on our code sooner 
than that. I'm getting pressure to just get only the patch for this one bug and 
apply it to our production code. 

I did get the jars built, but I'm not sure where in the Tika project structure 
the POI jars get imported from. Tika uses maven to build, which I'm not 
familiar with how to configure. I'm also not familiar with the Tika project 
structure, is there a place I can drop the POI jar file and maven will 
recognize it, or is there an environment variable or PATH I should set 
somewhere?

I've noticed in the tika-parsers/pom.xml that it mentions POI, but just as a 
comment and version numbers for a few properties. I also see the POI jars 
mentioned in target/classes/META-INF/DEPENDENCIES, but those are also just 
version numbers.

I see the POIContainerExtractionTest and ooxml packages in 
org/apache/tika/parser/microsoft, but those are just the tests.

Where does the POI jar go, and what needs to be configured before running mvn 
to build Tika?
My guess is that my ignorance on how maven works is why I'm not sure what to 
do, but based on using ant, I'm used to putting something in the right spot and 
possibly changing a build.xml property to make sure it is looking at the 
correct spot. What am I missing?
                
> .docx text extract leaves out some portions of text
> ---------------------------------------------------
>
>                 Key: TIKA-1130
>                 URL: https://issues.apache.org/jira/browse/TIKA-1130
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.2, 1.3
>         Environment: OpenJDK x86_64
>            Reporter: Daniel Gibby
>            Priority: Critical
>         Attachments: Resume 6.4.13.docx
>
>
> When parsing a Microsoft Word .docx 
> (application/vnd.openxmlformats-officedocument.wordprocessingml.document), 
> certain portions of text remain unextracted.
> I have attached a .docx file that can be tested against. The 'gray' portions 
> of text are what are not extracted, while the darker colored text extracts 
> fine.
> Looking at the document.xml portion of the .docx zip file shows the text is 
> all there.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text

Reply via email to