[ 
https://issues.apache.org/jira/browse/TIKA-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3968:
------------------------------
    Description: 
I'm starting to see among several users communicating with me privately that 
Microsoft -has changed their basic behavior- for files attached to at least 
docx files (possibly pptx and xlsx?).  Rather than storing the original file 
name, the file associates an EMF file with an attachment.  The filename that a 
human sees in the application is spelled/painted out in the EMF file, but does 
NOT exist in any of the XML.

I'm attaching an example file.

In fixing this issue, I've noticed that some of our fairly old docx files use 
this technique.  Not clear that it is a new thing, just happen to be hearing 
about it from several people.

  was:
I'm starting to see among several users communicating with me privately that 
Microsoft has changed their basic behavior for files attached to at least docx 
files (possibly pptx and xlsx?).  Rather than storing the original file name, 
the file associates an EMF file with an attachment.  The filename that a human 
sees in the application is spelled/painted out in the EMF file, but does NOT 
exist in any of the XML.

I'm attaching an example file.


> Reconstruct embedded file names from associated emf files within docx files
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-3968
>                 URL: https://issues.apache.org/jira/browse/TIKA-3968
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: Microsoft_Word_Document.docx, 
> image-2023-02-06-15-46-05-678.png, image-2023-02-06-15-58-20-443.png, 
> image1-1.emf, image1-2.emf, image1.emf, image2.emf, image3.emf, 
> oleObject1.bin, oleObject2.bin, testWORD has attachment.docx
>
>
> I'm starting to see among several users communicating with me privately that 
> Microsoft -has changed their basic behavior- for files attached to at least 
> docx files (possibly pptx and xlsx?).  Rather than storing the original file 
> name, the file associates an EMF file with an attachment.  The filename that 
> a human sees in the application is spelled/painted out in the EMF file, but 
> does NOT exist in any of the XML.
> I'm attaching an example file.
> In fixing this issue, I've noticed that some of our fairly old docx files use 
> this technique.  Not clear that it is a new thing, just happen to be hearing 
> about it from several people.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to