[
https://issues.apache.org/jira/browse/TIKA-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684932#comment-17684932
]
Tim Allison commented on TIKA-3968:
-----------------------------------
Sorry, very specifically: {{oleObject1.bin}} does contain the name of the text
file and its full original file path...yay! {{oleObject2.bin}} does not appear
to contain the name of the PDF file. The embedded docx file is renamed to
{{Microsoft_Word_Document.docx}}, and there doesn't seem to be any way to
reconstruct the original name from the embedded docx's embedded metadata.
> Reconstruct embedded file names from recent docx files
> ------------------------------------------------------
>
> Key: TIKA-3968
> URL: https://issues.apache.org/jira/browse/TIKA-3968
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
> Attachments: Microsoft_Word_Document.docx, image1.emf, image2.emf,
> image3.emf, oleObject1.bin, oleObject2.bin, testWORD has attachment.docx
>
>
> I'm starting to see among several users communicating with me privately that
> Microsoft has changed their basic behavior for files attached to at least
> docx files (possibly pptx and xlsx?). Rather than storing the original file
> name, the file associates an EMF file with an attachment. The filename that
> a human sees in the application is spelled/painted out in the EMF file, but
> does NOT exist in any of the XML.
> I'm attaching an example file.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)