Edit report at https://bugs.php.net/bug.php?id=65815&edit=1
ID: 65815
Comment by: matti dot jarvinen at nitroid dot fi
Reported by: matti dot jarvinen at nitroid dot fi
Summary: ZipArchive reads filenames with UTF-8 characters
wrong
Status: Open
Type: Bug
Package: Zip Related
Operating System: Fedora 3.8.6-203.fc18.x86_64
PHP Version: 5.4.20
Block user comment: N
Private report: N
New Comment:
If zip file contains following files:
test3/12-päivä.pdf
test3/ää¸å人æ°å
񆆫.PDF
test3/РоÑÑийÑÐºÐ°Ñ Ð¤ÐµÐ´ÐµÑаÑиÑ.PDF
test3/ä¸å人æ°å
񆆫.PDF
ZipArchive will read them as:
test3/12-p�iv�.pdf
test3/ää¸å人æ°å
񆆫.PDF
test3/РоÑÑийÑÐºÐ°Ñ Ð¤ÐµÐ´ÐµÑаÑиÑ.PDF
test3/ä¸å人æ°å
񆆫.PDF
Broken file names can be changed to correct UTF-8 characters with:
<?php
// correct UTF-8 should hold together through this
if($filename === mb_convert_encoding(mb_convert_encoding($filename, "UTF-32",
"UTF-8"), "UTF-8", "UTF-32"))
{
$fixedFilename = $filename;
}else
{
// otherwise we should use
$fixedFilename = mb_convert_encoding($filename, 'UTF-8','CP850');
}
?>
.ZIP File Format Specification Version: 6.3.3 APPENDIX D - Language Encoding
(EFS) might hold the answers about reading file name encoding correctly from
the zip file.
http://www.pkware.com/documents/casestudies/APPNOTE.TXT
Codepage if not UTF-8 should be CP437 if I understood correctly from the specs,
although that encoding is not supported in PHP. I got good results with CP850
but I cannot verify this with workaround with every character in CP850 and
CP437.
Previous Comments:
------------------------------------------------------------------------
[2013-10-02 15:51:05] matti dot jarvinen at nitroid dot fi
Description:
------------
I have a valid Zip file created with Windows 8 and with iZarc containing
filenames like 12-päivä.pdf, 13-päivä.pdf
ZipArchive reads filenames wrong.
At least getNameIndex and extractTo are affected.
Test script:
---------------
<?php
mb_internal_encoding('UTF-8');
ini_set('default_charset', 'UTF-8');
$Zip = new ZipArchive();
$open = $Zip->open('test.zip');
$length = $Zip->numFiles;
for($i = 0; $i < $length; $i++)
{
$importName = $Zip->getNameIndex($i);
print $brokenImportName;
die();
// this is a specific workaround. Some characters are stuck in ASCII
apparently
//$fixedImportName = str_replace(chr(132),'ä',$brokenImportName);
//print $fixedImportName;
}
?>
Expected result:
----------------
12-päivä.pdf
Actual result:
--------------
12-p�iv�.pdf
------------------------------------------------------------------------
--
Edit this bug report at https://bugs.php.net/bug.php?id=65815&edit=1