On Tue, Nov 08, 2005 at 05:08:08PM -0500, David Schlegel wrote: > And yes, figuring out the local code page on unix is particularly > squirrelly. The codepage for "fr_CA.ISOxxx" is pretty easy but what about > "fr_CA" and "fr" ? There are a lot of aliases and rules involved so that > the locale is just about useless (in one case you can tell it is shift-JIS > because the "j" in the locale is capitalized (I wish I was kidding!). > > As a number of others have suggested to me it seems like something basic > that Perl should absolutely know someplace internally. But I have yet to > find an API to get it. > If there was some way to do decode/encode without having to know the local > codepage that would make me happy to. I just want to get encode/decode to > work.
No, it's not something that Perl knows internally. By default all case conversion and similar operations are done 8 bit cleanly but assuming US-ASCII for 8 bit data. If you C<use locale>; then system locales are used for case related operations and collation. This is done by calling the C function setlocal() with the strings from the environment variables LC_CTYPE and LC_COLLATE, which sets the behaviour or C functions such as toupper() and tolower(). Hence Perl *still* has no idea what the local code page is called, even when it's told to use it. The situation is the same for any C program. Unicode tables are used for Unicode data, and there is a (buggy) assumption that 8 bit data can be converted to Unicode by assuming that it's ISO-8859-1. Definitely buggy. Not possible to change without breaking backward compatibility. Nicholas Clark