Here's an interesting note from p5p about string encodings, Unicode, and 
locales with regard to filenames.

-- c

----------  Forwarded Message  ----------

Subject: Re: [PATCH] Add note to perltodo.pod about Unicode and file globbing
Date: Friday 29 December 2006 11:50
From: Jan Dubois <[EMAIL PROTECTED]>
To: Juerd <[EMAIL PROTECTED]>
Cc: [email protected]

On Fri, 29 Dec 2006 17:56:54 +0100, Juerd <[EMAIL PROTECTED]> wrote:
>Jan Dubois skribis 2006-12-28 19:06 (-0800):
>> +Currently glob patterns and filenames returned from File::Glob::glob()
>> +are always byte strings.
>
>Aren't file names byte strings by definition, in all system calls too?

On most Unix systems this is true because the file systems store names
as byte sequences while ignoring the encoding.  This has the well known
problem that changing the locale setting for your application can change
the filenames as well.

Other systems (Windows, OS/2, OS X) store filenames in a specific
Unicode encoding and the system calls will encode/decode "byte string"
filenames at the API level.  This has the well known problem that some
files may be inaccessible through the "byte string" APIs because their
names cannot be encoded without using replacement characters.

On Windows you have 2 different APIs for most system calls, an "ANSI"
version that uses the current codepage to translate names to Unicode,
and a "Wide" version that takes WCHAR arguments directly.

E.g. you have CreateFileA() and CreateFileW().  The "ANSI" version
obviously can only deal with filenames that can be represented in the
current codepage ("ANSI" does *not* mean Latin1, but is really just a
locale setting).

>Can Perl know the encoding of a certain file name, in any reliable and
>portable way?

Perl knows if a string is a byte string or UTF8 encoded, so it _could_
to the right thing internally, but all the internal APIs in Perl are
taking char* arguments and not SV*s, so the SVf_UTF8 flag is lost in the
lower layers. :(

Changing all the internal APIs would be a lot of work and create quite a
few backwards compatibility problems, so I doubt this will be done for
Perl 5.  I just hope that the Perl 6 implementation(s) will get this
right.

For now I'm trying to provide a workaround for many of the Unicode
filename problems on Windows by using the "short filenames" provided by
NTFS whenever the real Unicode filename cannot be represented in a byte
string.  This will not be as nice as full support for Unicode filenames,
but at least it will make it possible to work with these files from Perl
at all.

Cheers,
-Jan

-------------------------------------------------------

Reply via email to