Ross Johnson created TIKA-3388:
----------------------------------

             Summary: Ole10Native attachments with non-ASCII filenames 
extracted with garbled names
                 Key: TIKA-3388
                 URL: https://issues.apache.org/jira/browse/TIKA-3388
             Project: Tika
          Issue Type: Bug
    Affects Versions: 1.26
            Reporter: Ross Johnson
         Attachments: Ole10Native att with Unicode name.docx

I've encountered some Word files that have Ole10Native embeddeds which Tika 
extracts with strange filenames. It looks like the attachments were originally 
named with Chinese & Unicode characters, and the filename that Tika is giving 
is a cp1252 interpretation of the original UTF-8-encoded filename.

Looking closer at the Ole10Native stream of these files, it does seem like 
there is a UTF-8 version of the filename stored, as well as a UTF-16 version of 
the filename stored later on after the actual attachment data. I believe POI is 
returning this first UTF-8 version of the filename interpreted as if it were 
ANSI / cp1252.

A possible solution would for Apache POI to read and return the provided UTF-16 
filename if it is present. Alternatively, Tika could check the currently 
returned "ANSI" name to see if it might actually be valid UTF-8.

Attached is an sample file I made which has a .msg file with name 
"約翰的測試文件🖖.msg" embedded in a .docx file. Tika currently extracts the attachment 
with filename "約翰的測試文件🖖.msg"

--

Regarding the Ole10Native data stream, I can't find any official documentation 
for its structure, but these extra three UTF-16 string properties I'm seeing at 
the end look to follow the following format:

- The strings are not null terminated, but instead are proceeded by a 4-byte 
string length value. Note that this value is the number of 16-bit code units in 
the UTF-16 string and not the byte length.
- The order of the 3 strings is temporary path, filename, original path. This 
differs from the order of the normal ANSI / UTF-8 strings near the beginning of 
the Ole10Native stream which is filename, original path, temporary path.
- I'm assuming these wide variants of these strings are optional and may not be 
present.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to