[
https://issues.apache.org/jira/browse/TIKA-989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless updated TIKA-989:
------------------------------------
Attachment: TIKA-989.patch
Patch w/ test, adding placeholders for embedded files inside Word
(.docx) files. The output is just like TIKA-XXX (<div
class="embedded" id="XXX"/>), but the XXX is now a rIdN (relationship
ID), and the filename when recursing on each embedded file is (starts
with) the relationship ID.
It's rather hackity how I dig into the XML looking for the embedded
object ... if someone familiar with XPath/XQuery could fix it up that
would be nice!
> We don't extract a placeholder for documents embedded in a Word OOXML (.docx)
> document
> --------------------------------------------------------------------------------------
>
> Key: TIKA-989
> URL: https://issues.apache.org/jira/browse/TIKA-989
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 1.3
>
> Attachments: TIKA-989.patch
>
>
> In TIKA-956 we fixed the Word parser so that at the point where an embedded
> document appears, we output a <div class="embedded" id="_XXX"/> tag.
> It would be nice to do this for documents embedded in OOXML documents too.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira