Edit report at https://bugs.php.net/bug.php?id=47096&edit=1
ID: 47096
Comment by: nicolas dot grekas+php at gmail dot com
Reported by: nuabaranda at web dot de
Summary: move_uploaded_file not OS encoding aware
Status: Open
Type: Bug
Package: Filesystem function related
Operating System: win32 only - Windows XP
PHP Version: 5.2.8
Block user comment: N
Private report: N
New Comment:
Well, if you really need it, there may be one possibility using a COM object:
$fs = new \COM('Scripting.FileSystemObject', null, CP_UTF8);
Previous Comments:
------------------------------------------------------------------------
[2012-04-03 15:12:07] salsi at icosaedro dot it
Just to complete my little survey of the file names encoding issue:
1. Under Windows Vista, in the control panel "Regional and Language Settings"
also the "Formats" panel must be set accordingly to the language selected in
the "Advanced" panel in order to set the LC_CTYPE property; the "Advanced"
panel only selects the translation mapping between Unicode and multi-byte
encoding but does not set the locale properties.
For example, on a western country LC_CTYPE="english_United States.1252" while
in Japan it might be LC_CTYPE="Japanese_Japan.1252".
2. Windows applies the "best fit" conversion table
(http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/) when
translating from Unicode file names to multi-byte file name
(http://msdn.microsoft.com/en-us/library/windows/desktop/dd374047%28v=vs.85%29.aspx);
characters that have not a best fit are replaced by a question mark "?".
So, for example, when the japanese locale is set (code page 932) the Latin
capital letter A with dieresis ("Ã") might map to the plain capital letter "A"
and accented vouels like "à èìòù" might be translated to the plain ASCII
letters "aeiou".
This means that from inside PHP file names retrieved from the file system via
dir() or getcwd() are only APPROXYMATIONS of the real path and there is no way
to detect if they really match the actual name.
Conclusions
===========
Under Unix and Linux with a properly set locale, PHP program can access and
retrieve any file name that match the current locale; UTF-8 is the better
choice here.
Under Windows, PHP programs can generate and can access any file or file path
that contains only characters included in the current code page table; however,
PHP programs cannot trust on file names retrieved from the file system because
these might be arbitrarily mangled and there is no way to detect such artifact.
------------------------------------------------------------------------
[2012-03-17 18:19:24] salsi at icosaedro dot it
As PHP operates under Windows as a "non-Unicode aware program", file names are
bare array of bytes represented under PHP as "string"; these strings are
converted back and forth to Unicode by Windows according to the currently
selected "code page table" (see "Control Panel", "Regional and Language
Options", "Administrative" tab panel, "Language for non-Unicode programs").
Unfortunately, UTF-8 encoding is not available there, so whatever locale you
choose, some Unicode file names may still remain unaccessible to PHP.
For example, if your system locale is any western european encoding (code page
1252), there is no way to refer to a file whose name is "æ¥æ¬èª"; only on
Windows system with japanese locale set (code page 932) you can access such a
name, provided that the "string" that represents that name be properly encoded
as requested by the code page 932, that is "\x93\xfa\x96\x7b\x8c\xea".
So, if you have a generic name of a file (along with its path) as a Unicode
string $u (for example UTF-8 encoded) and you want to try to save it with that
name under Windows, you must first check the current locale calling
setlocale(LC_CTYPE, 0) to retrieve the current code page, then you must convert
$u to an array of bytes according to the code page; if one or more code points
have no counterpart in the current code page, the file cannot be saved with
that name from PHP. Dot.
To complicate the implementation of such an algorithm, neither mbstring nor
iconv are aware of all the Windows code pages, so you must write these
conversion routines by yourself. This is just what I have done experimentally
under PHP, and it appears to work nicely
(http://www.icosaedro.it/phplint/libraries.cgi?lib=stdlib/it/icosaedro/io/FileName.html).
Hopefully some day something similar will be available in PHP core lib., or
some other abstraction layer of classes may provide full access to the Unicode
realm.
References:
http://en.wikipedia.org/wiki/Windows_code_page
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/
------------------------------------------------------------------------
[2011-09-23 03:02:09] xd-yang at qq dot com
Since basename() is locale aware, why not move_uploaded_file()?
A common remedial measure is to use iconv() to explicitly convert the
destination filename encoding usually from utf-8 to ansi(like gb2312). But this
becomes complicated and unreachable in a multilingual CMS, like wordpress. Can
this issue be solved in the future?
------------------------------------------------------------------------
[2009-02-26 09:46:51] mm107137 at spamcorptastic dot com
I have the same problem under debian host (ovh hoster).
Filename with french accents passed to move_upload_file are destroyed.
There's no problems if filename is not passed as utf8.
Very annoying
------------------------------------------------------------------------
[2009-02-06 20:21:49] mindfreakthemon at gmail dot com
And on Windows 7 and Vista under Apache 2.2 that bug exists too.
------------------------------------------------------------------------
The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
https://bugs.php.net/bug.php?id=47096
--
Edit this bug report at https://bugs.php.net/bug.php?id=47096&edit=1