Here's an interesting note from p5p about string encodings, Unicode, and locales with regard to filenames.
-- c ---------- Forwarded Message ---------- Subject: Re: [PATCH] Add note to perltodo.pod about Unicode and file globbing Date: Friday 29 December 2006 11:50 From: Jan Dubois <[EMAIL PROTECTED]> To: Juerd <[EMAIL PROTECTED]> Cc: [email protected] On Fri, 29 Dec 2006 17:56:54 +0100, Juerd <[EMAIL PROTECTED]> wrote: >Jan Dubois skribis 2006-12-28 19:06 (-0800): >> +Currently glob patterns and filenames returned from File::Glob::glob() >> +are always byte strings. > >Aren't file names byte strings by definition, in all system calls too? On most Unix systems this is true because the file systems store names as byte sequences while ignoring the encoding. This has the well known problem that changing the locale setting for your application can change the filenames as well. Other systems (Windows, OS/2, OS X) store filenames in a specific Unicode encoding and the system calls will encode/decode "byte string" filenames at the API level. This has the well known problem that some files may be inaccessible through the "byte string" APIs because their names cannot be encoded without using replacement characters. On Windows you have 2 different APIs for most system calls, an "ANSI" version that uses the current codepage to translate names to Unicode, and a "Wide" version that takes WCHAR arguments directly. E.g. you have CreateFileA() and CreateFileW(). The "ANSI" version obviously can only deal with filenames that can be represented in the current codepage ("ANSI" does *not* mean Latin1, but is really just a locale setting). >Can Perl know the encoding of a certain file name, in any reliable and >portable way? Perl knows if a string is a byte string or UTF8 encoded, so it _could_ to the right thing internally, but all the internal APIs in Perl are taking char* arguments and not SV*s, so the SVf_UTF8 flag is lost in the lower layers. :( Changing all the internal APIs would be a lot of work and create quite a few backwards compatibility problems, so I doubt this will be done for Perl 5. I just hope that the Perl 6 implementation(s) will get this right. For now I'm trying to provide a workaround for many of the Unicode filename problems on Windows by using the "short filenames" provided by NTFS whenever the real Unicode filename cannot be represented in a byte string. This will not be as nice as full support for Unicode filenames, but at least it will make it possible to work with these files from Perl at all. Cheers, -Jan -------------------------------------------------------
