Ryan Liu created TIKA-3374:
------------------------------
Summary: Non-Unicode archive entry name is garbled
Key: TIKA-3374
URL: https://issues.apache.org/jira/browse/TIKA-3374
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.26
Environment: The attachment is an example of a Non-Unicode archive
entry name been used in a zip file.
The filename in the zip file should be
{color:#172b4d}*集团邮件审计系统2021年自动巡检需求文档_V4.0.doc*{color}
but is gabled in TIKA 1.26 since the PackageParser treat it as unicode.
Reporter: Ryan Liu
Attachments: gbk.zip
PackageParser retrieves archive entry name through commons-compress archiver's
ArchiveEntry#getName function and does not have automatic charset detection for
entry names.
Although one could set encoding by passing ArchiveStreamFactory(charset) into
parser context,
It is not practical since all kinds of charset could be used in an archive
file.
Instead of directly calling entry.getName() in the PackageParser#parseEntry()
function,
use entry.getRawName() and apply charset detection to reduce the possibility of
getting garbled string is recommended.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)