[
https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692077#comment-13692077
]
Daniel Gibby commented on TIKA-1130:
------------------------------------
I'd rather wait a week for a beta, but I need to test this on our code sooner
than that. I'm getting pressure to just get only the patch for this one bug and
apply it to our production code.
I did get the jars built, but I'm not sure where in the Tika project structure
the POI jars get imported from. Tika uses maven to build, which I'm not
familiar with how to configure. I'm also not familiar with the Tika project
structure, is there a place I can drop the POI jar file and maven will
recognize it, or is there an environment variable or PATH I should set
somewhere?
I've noticed in the tika-parsers/pom.xml that it mentions POI, but just as a
comment and version numbers for a few properties. I also see the POI jars
mentioned in target/classes/META-INF/DEPENDENCIES, but those are also just
version numbers.
I see the POIContainerExtractionTest and ooxml packages in
org/apache/tika/parser/microsoft, but those are just the tests.
Where does the POI jar go, and what needs to be configured before running mvn
to build Tika?
My guess is that my ignorance on how maven works is why I'm not sure what to
do, but based on using ant, I'm used to putting something in the right spot and
possibly changing a build.xml property to make sure it is looking at the
correct spot. What am I missing?
> .docx text extract leaves out some portions of text
> ---------------------------------------------------
>
> Key: TIKA-1130
> URL: https://issues.apache.org/jira/browse/TIKA-1130
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.2, 1.3
> Environment: OpenJDK x86_64
> Reporter: Daniel Gibby
> Priority: Critical
> Attachments: Resume 6.4.13.docx
>
>
> When parsing a Microsoft Word .docx
> (application/vnd.openxmlformats-officedocument.wordprocessingml.document),
> certain portions of text remain unextracted.
> I have attached a .docx file that can be tested against. The 'gray' portions
> of text are what are not extracted, while the darker colored text extracts
> fine.
> Looking at the document.xml portion of the .docx zip file shows the text is
> all there.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira