Re: Converting between UTF8 and local codepage without specifying local codepage

Nicholas Clark Wed, 09 Nov 2005 02:49:41 -0800

On Tue, Nov 08, 2005 at 05:08:08PM -0500, David Schlegel wrote:
> And yes, figuring out the local code page on unix is particularly 
> squirrelly.  The codepage for "fr_CA.ISOxxx" is pretty easy but what about 
> "fr_CA" and "fr" ? There are a lot of aliases and rules involved so that 
> the locale is just about useless (in one case you can tell it is shift-JIS 
> because the "j" in the locale is capitalized (I wish I was kidding!). 
> 
> As a number of others have suggested to me it seems like something basic 
> that Perl should absolutely know someplace internally. But I have yet to 
> find an API to get it. 
> If there was some way to do decode/encode without having to know the local 
> codepage that would make me happy to. I just want to get encode/decode to 
> work.


No, it's not something that Perl knows internally. By default all case
conversion and similar operations are done 8 bit cleanly but assuming
US-ASCII for 8 bit data. If you C<use locale>; then system locales are used
for case related operations and collation. This is done by calling the C
function setlocal() with the strings from the environment variables LC_CTYPE
and LC_COLLATE, which sets the behaviour or C functions such as toupper() and
tolower(). Hence Perl *still* has no idea what the local code page is
called, even when it's told to use it. The situation is the same for any C
program.

Unicode tables are used for Unicode data, and there is a (buggy) assumption
that 8 bit data can be converted to Unicode by assuming that it's ISO-8859-1.
Definitely buggy. Not possible to change without breaking backward
compatibility.

Nicholas Clark

Re: Converting between UTF8 and local codepage without specifying local codepage

Reply via email to