[ 
https://issues.apache.org/jira/browse/TIKA-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ChetanB updated TIKA-3968:
--------------------------
    Attachment: Inner Test Email.msg
                symbol.docx

Hi Tim,

How are you?
Recently I was observed below while processing an email file.
1) If we include any symbol like abc tika translating these symbols as *abc*
not returning as is.
2) For .Msg email file *Message:From-Email *incorrectly getting.
3)  For .Msg email file unable to get language ( like en-US)  as part of
parsing result.

Please find the attached example document for these scenarios.
Sorry for sharing these files to you directly as I cant be able to share
publicly.


Thanks,
Chetan




> Reconstruct embedded file names from associated emf files within docx files
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-3968
>                 URL: https://issues.apache.org/jira/browse/TIKA-3968
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>             Fix For: 2.7.1
>
>         Attachments: Inner Test Email.msg, Microsoft_Word_Document.docx, 
> image-2023-02-06-15-46-05-678.png, image-2023-02-06-15-58-20-443.png, 
> image1-1.emf, image1-2.emf, image1.emf, image2.emf, image3.emf, 
> oleObject1.bin, oleObject2.bin, symbol.docx, testWORD has attachment.docx
>
>
> I'm starting to see among several users communicating with me privately that 
> Microsoft -has changed their basic behavior- for files attached to at least 
> docx files (possibly pptx and xlsx?).  Rather than storing the original file 
> name, the file associates an EMF file with an attachment.  The filename that 
> a human sees in the application is spelled/painted out in the EMF file, but 
> does NOT exist in any of the XML.
> I'm attaching an example file.
> In fixing this issue, I've noticed that some of our fairly old docx files use 
> this technique.  Not clear that it is a new thing, just happen to be hearing 
> about it from several people.
> I'd like to thank Chetan Bikire ([~chetab]) for raising this issue and 
> sharing the example document which we've added to our unit tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to