Re: Converting between UTF8 and local codepage without specifying local codepage
[EMAIL PROTECTED] said: ... I have also come to the realization that Perl is not using the underlying system code pages but is relying on its own encoding objects to handle conversions. Since only a small set of encoding objects are available by default this would mean that I would need to load up additional Perl CPAN modules to get additional language encodings, otherwise my code wouldn't be able to run much outside of ASCII and English environments. Windows seemed to work ok with Simplified Chinese using the Encode package but maybe the Windows implementation does use the underlying system codepages somehow ? I sorry, I'm not sure I understand what you mean by only a small set of encoding objects. Regarding the Encode module and what it can handle in a default installation, I see a total of 124 labels for supported encodings, including: - 11 distinct labels for various unicode encodings, - 2 relating to ascii - all the iso-8859's (1-16) - 38 different cp\d+ - 2 each of big5.* gb\d+ - 3 each of euc-??, jis\d+ and koi8-. - shiftjis and 7bit-jis - a bunch of Mac codepages (some of which aren't really functional, but that's a separate topic) - and more... (As you probably know, the Encode man page tells how to get a complete list of installed encodings. Presumably, some are synonyms for others.) Are you referring to something other than codepages/encodings when you mention only a small set of encoding objects? Or are you saying that 124 is only a small set? So am I correct that I would need to load up additional encodings and I couldn't count on Perl to access the wide range of available system encodings otherwise ? I just need to confirm that I am not misunderstanding something here. If you could mention some specific items in the wide range of available system encodings that do not show up within the Encode module's inventory, that would help to clear things up. Dave Graff
Re: Converting between UTF8 and local codepage without specifying local codepage
On Tue, Nov 08, 2005 at 05:08:08PM -0500, David Schlegel wrote: And yes, figuring out the local code page on unix is particularly squirrelly. The codepage for fr_CA.ISOxxx is pretty easy but what about fr_CA and fr ? There are a lot of aliases and rules involved so that the locale is just about useless (in one case you can tell it is shift-JIS because the j in the locale is capitalized (I wish I was kidding!). As a number of others have suggested to me it seems like something basic that Perl should absolutely know someplace internally. But I have yet to find an API to get it. If there was some way to do decode/encode without having to know the local codepage that would make me happy to. I just want to get encode/decode to work. No, it's not something that Perl knows internally. By default all case conversion and similar operations are done 8 bit cleanly but assuming US-ASCII for 8 bit data. If you Cuse locale; then system locales are used for case related operations and collation. This is done by calling the C function setlocal() with the strings from the environment variables LC_CTYPE and LC_COLLATE, which sets the behaviour or C functions such as toupper() and tolower(). Hence Perl *still* has no idea what the local code page is called, even when it's told to use it. The situation is the same for any C program. Unicode tables are used for Unicode data, and there is a (buggy) assumption that 8 bit data can be converted to Unicode by assuming that it's ISO-8859-1. Definitely buggy. Not possible to change without breaking backward compatibility. Nicholas Clark
Re: Converting between UTF8 and local codepage without specifying local codepage
On Wed, Nov 09, 2005 at 10:02:31AM -0500, David Schlegel wrote: That is helpful information. I have been spending time to determine the local page by other means but have consistently been challenged that this is the wrong approach and that Perl must know somehow. Getting a definitive answer is almost as helpful as getting a better answer. Based on what you are saying, there is no way to ask Perl what the local codepage is and hence there can be no variant of Encode which can be told to convert from local codepage to UTF8 without having to provide the local codepage value explicitly. Yes. A good summary of the situation. Is I18N::Langinfo(CODESET()) the best way to determine the local codepage for Unix ? Windows seems to reliably include the codepage number in the locale but Unix is all over the map. I don't know. I have little to no experience of doing conversion of real data, certainly for data outside of ISO-8859-1 and UTF-8, and I've never used I18N::Langinfo. I hope that someone else on this list can give a decent answer. Nicholas Clark
Re: Converting between UTF8 and local codepage without specifying local codepage
[EMAIL PROTECTED] said: Is there someway to convert from whatever the local codepage is to utf8 and back again ? The Encode::encode and decode routines require passing a specific codepage to do the conversion but finding out what the local codepage is is very tricky across different platforms, particularly UNIX where it is hard to determine. Have you looked at the perllocale man page? It's not clear to me that figuring out the local codepage (i.e. the locale) is particularly hard on unix systems -- that's what the POSIX locale protocol is for. (I don't know how you would figure it out on MS-Windows systems, but that's more a matter of me being blissfully ignorant of MS software generally.) If you're dealing with data of unknown origin, and it's in some clearly non-ASCII, non-Unicode encoding, then being able to detect its character set is a speculative matter, especially for text in languages that use single-byte encodings. The Encode::Guess module can help in detecting any of the unicode encodings and most of the multi-byte non-unicode sets (i.e. the legacy code pages for Chinese, Japanese and Korean), but it can't help much when it comes to correctly detecting, say, ISO Cyrillic vs. ISO Greek (vs. Thai vs. Arabic ...), let alone Latin1 vs. Latin2. David Graff
Re: Converting between UTF8 and local codepage without specifying local codepage
Yes I've re-read it after your suggestion but the one area it completely dances around is the local codepage. And from my use of Encode::decode and encode, it is the one piece of information that it seems I am required to know when converting local strings to UTF8. The data isn't of unknown origin - it just came in from stdin or a local file that I know is in local codepage. \ And yes, figuring out the local code page on unix is particularly squirrelly. The codepage for fr_CA.ISOxxx is pretty easy but what about fr_CA and fr ? There are a lot of aliases and rules involved so that the locale is just about useless (in one case you can tell it is shift-JIS because the j in the locale is capitalized (I wish I was kidding!). As a number of others have suggested to me it seems like something basic that Perl should absolutely know someplace internally. But I have yet to find an API to get it. If there was some way to do decode/encode without having to know the local codepage that would make me happy to. I just want to get encode/decode to work. David Graff [EMAIL PROTECTED] 11/07/2005 08:20 PM To David Schlegel/Lexington/[EMAIL PROTECTED] cc perl-unicode@perl.org Subject Re: Converting between UTF8 and local codepage without specifying local codepage [EMAIL PROTECTED] said: Is there someway to convert from whatever the local codepage is to utf8 and back again ? The Encode::encode and decode routines require passing a specific codepage to do the conversion but finding out what the local codepage is is very tricky across different platforms, particularly UNIX where it is hard to determine. Have you looked at the perllocale man page? It's not clear to me that figuring out the local codepage (i.e. the locale) is particularly hard on unix systems -- that's what the POSIX locale protocol is for. (I don't know how you would figure it out on MS-Windows systems, but that's more a matter of me being blissfully ignorant of MS software generally.) If you're dealing with data of unknown origin, and it's in some clearly non-ASCII, non-Unicode encoding, then being able to detect its character set is a speculative matter, especially for text in languages that use single-byte encodings. The Encode::Guess module can help in detecting any of the unicode encodings and most of the multi-byte non-unicode sets (i.e. the legacy code pages for Chinese, Japanese and Korean), but it can't help much when it comes to correctly detecting, say, ISO Cyrillic vs. ISO Greek (vs. Thai vs. Arabic ...), let alone Latin1 vs. Latin2. David Graff