[
https://issues.apache.org/jira/browse/TIKA-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685447#comment-17685447
]
ASF GitHub Bot commented on TIKA-3968:
--------------------------------------
tballison merged PR #948:
URL: https://github.com/apache/tika/pull/948
> Reconstruct embedded file names from associated emf files within docx files
> ---------------------------------------------------------------------------
>
> Key: TIKA-3968
> URL: https://issues.apache.org/jira/browse/TIKA-3968
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
> Attachments: Microsoft_Word_Document.docx,
> image-2023-02-06-15-46-05-678.png, image-2023-02-06-15-58-20-443.png,
> image1-1.emf, image1-2.emf, image1.emf, image2.emf, image3.emf,
> oleObject1.bin, oleObject2.bin, testWORD has attachment.docx
>
>
> I'm starting to see among several users communicating with me privately that
> Microsoft -has changed their basic behavior- for files attached to at least
> docx files (possibly pptx and xlsx?). Rather than storing the original file
> name, the file associates an EMF file with an attachment. The filename that
> a human sees in the application is spelled/painted out in the EMF file, but
> does NOT exist in any of the XML.
> I'm attaching an example file.
> In fixing this issue, I've noticed that some of our fairly old docx files use
> this technique. Not clear that it is a new thing, just happen to be hearing
> about it from several people.
> I'd like to thank Chetan Bikire ([~chetab]) for raising this issue and
> sharing the example document which we've added to our unit tests.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)