[jira] [Updated] (TIKA-3388) Ole10Native attachments with non-ASCII filenames extracted with garbled names

Tim Allison (Jira) Mon, 06 Feb 2023 14:42:04 -0800


     [ 
https://issues.apache.org/jira/browse/TIKA-3388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tim Allison updated TIKA-3388:
------------------------------
    Fix Version/s: 2.7.0

> Ole10Native attachments with non-ASCII filenames extracted with garbled names
> -----------------------------------------------------------------------------
>
>                 Key: TIKA-3388
>                 URL: https://issues.apache.org/jira/browse/TIKA-3388
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.26
>            Reporter: Ross Johnson
>            Priority: Minor
>             Fix For: 2.7.0
>
>         Attachments: Ole10Native att with Unicode name.docx
>
>
> I've encountered some Word files that have Ole10Native embeddeds which Tika 
> extracts with strange filenames. It looks like the attachments were 
> originally named with Chinese & Unicode characters, and the filename that 
> Tika is giving is a cp1252 interpretation of the original UTF-8-encoded 
> filename.
> Looking closer at the Ole10Native stream of these files, it does seem like 
> there is a UTF-8 version of the filename stored, as well as a UTF-16 version 
> of the filename stored later on after the actual attachment data. I believe 
> POI is returning this first UTF-8 version of the filename interpreted as if 
> it were ANSI / cp1252.
> A possible solution would for Apache POI to read and return the provided 
> UTF-16 filename if it is present. Alternatively, Tika could check the 
> currently returned "ANSI" name to see if it might actually be valid UTF-8.
> Attached is an sample file I made which has a .msg file with name 
> "約翰的測試文件🖖.msg" embedded in a .docx file. Tika currently extracts the 
> attachment with filename "ç´ç¿°çæ¸¬è©¦æä»¶ð.msg"
> --
> Regarding the Ole10Native data stream, I can't find any official 
> documentation for its structure, but these extra three UTF-16 string 
> properties I'm seeing at the end look to follow the following format:
> - The strings are not null terminated, but instead are proceeded by a 4-byte 
> string length value. Note that this value is the number of 16-bit code units 
> in the UTF-16 string and not the byte length.
> - The order of the 3 strings is temporary path, filename, original path. This 
> differs from the order of the normal ANSI / UTF-8 strings near the beginning 
> of the Ole10Native stream which is filename, original path, temporary path.
> - I'm assuming these wide variants of these strings are optional and may not 
> be present.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (TIKA-3388) Ole10Native attachments with non-ASCII filenames extracted with garbled names

Reply via email to