Robert Kaulbach created TIKA-3157:
-------------------------------------
Summary: Missing content from .docx file with hyperlinked shape
Key: TIKA-3157
URL: https://issues.apache.org/jira/browse/TIKA-3157
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.24.1
Reporter: Robert Kaulbach
The attached .docx file was created in MS Office, simply drew a rectangle and
then added a hyperlink to it. While the hyperlink doesn't show inside
LibreOffice, it's still there and clickable when opened with MS Office.
When parsing with Tika, the hyperlink attached to the shape is nowhere to be
found in the output. Enabling all Office/OOXML parse options in the context has
not helped.
When debugging, I can see the linked shape is being skipped at
org/apache/tika/parser/microsoft/ooxml/OOXMLWordAndPowerPointTextHandler.java
in the StartElement method, because "inACChoiceDepth" is greater than 0.
For my use case I'd like to extract as much information as possible from the
document. It would be helpful if the parser config could either disable this
check on "inACChoiceDepth" or increase the allowed limit before skipping
content.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)