That is helpful information. I have been spending time to determine the local page by other means but have consistently been challenged that this is the wrong approach and that Perl must know somehow. Getting a definitive answer is almost as helpful as getting a better answer.

Based on what you are saying, there is no way to ask Perl what the "local codepage" is and hence there can be no variant of "Encode" which can be told to convert from "local codepage" to UTF8 without having to provide the "local codepage" value explicitly.

Is I18N::Langinfo(CODESET())  the best way to determine the local codepage for Unix ? Windows seems to reliably include the codepage number in the locale but Unix is all over the map.

I greatly appreciate your responses.



Nicholas Clark <[EMAIL PROTECTED]>
Sent by: Nicholas Clark <[EMAIL PROTECTED]>

11/09/2005 05:49 AM

To
David Schlegel/Lexington/[EMAIL PROTECTED]
cc
David Graff <[EMAIL PROTECTED]>, perl-unicode@perl.org
Subject
Re: Converting between UTF8 and local codepage without specifying local codepage





On Tue, Nov 08, 2005 at 05:08:08PM -0500, David Schlegel wrote:
> And yes, figuring out the local code page on unix is particularly
> squirrelly.  The codepage for "fr_CA.ISOxxx" is pretty easy but what about
> "fr_CA" and "fr" ? There are a lot of aliases and rules involved so that
> the locale is just about useless (in one case you can tell it is shift-JIS
> because the "j" in the locale is capitalized (I wish I was kidding!).
>
> As a number of others have suggested to me it seems like something basic
> that Perl should absolutely know someplace internally. But I have yet to
> find an API to get it.
> If there was some way to do decode/encode without having to know the local
> codepage that would make me happy to. I just want to get encode/decode to
> work.

No, it's not something that Perl knows internally. By default all case
conversion and similar operations are done 8 bit cleanly but assuming
US-ASCII for 8 bit data. If you C<use locale>; then system locales are used
for case related operations and collation. This is done by calling the C
function setlocal() with the strings from the environment variables LC_CTYPE
and LC_COLLATE, which sets the behaviour or C functions such as toupper() and
tolower(). Hence Perl *still* has no idea what the local code page is
called, even when it's told to use it. The situation is the same for any C
program.

Unicode tables are used for Unicode data, and there is a (buggy) assumption
that 8 bit data can be converted to Unicode by assuming that it's ISO-8859-1.
Definitely buggy. Not possible to change without breaking backward
compatibility.

Nicholas Clark

Reply via email to