Re: Converting between UTF8 and local codepage without specifying local codepage

2005-11-11 Thread David Graff

[EMAIL PROTECTED] said:
 ... I have also come to the realization that Perl is not using the
 underlying system code pages but is relying on its own encoding objects
 to handle conversions. Since only a small set of encoding objects are
 available by default this would mean that I would need to load up
 additional Perl CPAN modules to get additional language encodings,
 otherwise my code wouldn't be able to run much outside of ASCII and
 English environments. Windows seemed to work ok with Simplified Chinese
 using the Encode package but maybe the Windows implementation does use
 the underlying system codepages somehow ?  

I sorry, I'm not sure I understand what you mean by only a small set of
encoding objects.  Regarding the Encode module and what it can handle in a
default installation, I see a total of 124 labels for supported
encodings, including:

 - 11 distinct labels for various unicode encodings, 
 - 2 relating to ascii
 - all the iso-8859's (1-16)
 - 38 different cp\d+
 - 2 each of big5.* gb\d+
 - 3 each of euc-??, jis\d+ and koi8-.
 - shiftjis and 7bit-jis
 - a bunch of Mac codepages (some of which aren't really functional,
  but that's a separate topic)
 - and more...

(As you probably know, the Encode man page tells how to get a complete list
of installed encodings.  Presumably, some are synonyms for others.)

Are you referring to something other than codepages/encodings when you 
mention only a small set of encoding objects?  Or are you saying that 
124 is only a small set?

 So am I correct that I would need to load up additional encodings and I
 couldn't count on Perl to access the wide range of available system
 encodings otherwise ? I just need to confirm that I am not
 misunderstanding something here.

If you could mention some specific items in the wide range of available
system encodings that do not show up within the Encode module's inventory,
that would help to clear things up.

Dave Graff




Re: Converting between UTF8 and local codepage without specifying local codepage

2005-11-09 Thread Nicholas Clark
On Tue, Nov 08, 2005 at 05:08:08PM -0500, David Schlegel wrote:
 And yes, figuring out the local code page on unix is particularly 
 squirrelly.  The codepage for fr_CA.ISOxxx is pretty easy but what about 
 fr_CA and fr ? There are a lot of aliases and rules involved so that 
 the locale is just about useless (in one case you can tell it is shift-JIS 
 because the j in the locale is capitalized (I wish I was kidding!). 
 
 As a number of others have suggested to me it seems like something basic 
 that Perl should absolutely know someplace internally. But I have yet to 
 find an API to get it. 
 If there was some way to do decode/encode without having to know the local 
 codepage that would make me happy to. I just want to get encode/decode to 
 work. 

No, it's not something that Perl knows internally. By default all case
conversion and similar operations are done 8 bit cleanly but assuming
US-ASCII for 8 bit data. If you Cuse locale; then system locales are used
for case related operations and collation. This is done by calling the C
function setlocal() with the strings from the environment variables LC_CTYPE
and LC_COLLATE, which sets the behaviour or C functions such as toupper() and
tolower(). Hence Perl *still* has no idea what the local code page is
called, even when it's told to use it. The situation is the same for any C
program.

Unicode tables are used for Unicode data, and there is a (buggy) assumption
that 8 bit data can be converted to Unicode by assuming that it's ISO-8859-1.
Definitely buggy. Not possible to change without breaking backward
compatibility.

Nicholas Clark


Re: Converting between UTF8 and local codepage without specifying local codepage

2005-11-09 Thread Nicholas Clark
On Wed, Nov 09, 2005 at 10:02:31AM -0500, David Schlegel wrote:
 That is helpful information. I have been spending time to determine the 
 local page by other means but have consistently been challenged that this 
 is the wrong approach and that Perl must know somehow. Getting a 
 definitive answer is almost as helpful as getting a better answer. 
 
 Based on what you are saying, there is no way to ask Perl what the local 
 codepage is and hence there can be no variant of Encode which can be 
 told to convert from local codepage to UTF8 without having to provide 
 the local codepage value explicitly. 

Yes. A good summary of the situation.

 Is I18N::Langinfo(CODESET())  the best way to determine the local codepage 
 for Unix ? Windows seems to reliably include the codepage number in the 
 locale but Unix is all over the map.

I don't know. I have little to no experience of doing conversion of real
data, certainly for data outside of ISO-8859-1 and UTF-8, and I've never used
I18N::Langinfo. I hope that someone else on this list can give a decent
answer.

Nicholas Clark



Re: Converting between UTF8 and local codepage without specifying local codepage

2005-11-08 Thread David Graff

[EMAIL PROTECTED] said:
 Is there someway to convert from whatever the local codepage is to utf8
 and back again ?  

 The Encode::encode and decode routines require passing a specific
 codepage to do the conversion but finding out what the local codepage
 is is very tricky across different platforms, particularly UNIX where it
 is hard to determine.  

Have you looked at the perllocale man page?  It's not clear to me that
figuring out the local codepage (i.e. the locale) is particularly hard
on unix systems -- that's what the POSIX locale protocol is for.  (I 
don't know how you would figure it out on MS-Windows systems, but that's 
more a matter of me being blissfully ignorant of MS software generally.)

If you're dealing with data of unknown origin, and it's in some clearly 
non-ASCII, non-Unicode encoding, then being able to detect its character 
set is a speculative matter, especially for text in languages that use 
single-byte encodings.

The Encode::Guess module can help in detecting any of the unicode
encodings and most of the multi-byte non-unicode sets (i.e. the legacy code
pages for Chinese, Japanese and Korean), but it can't help much when it
comes to correctly detecting, say, ISO Cyrillic vs. ISO Greek (vs. Thai vs.
Arabic ...), let alone Latin1 vs. Latin2.

David Graff




Re: Converting between UTF8 and local codepage without specifying local codepage

2005-11-08 Thread David Schlegel

Yes I've re-read it after your suggestion
but the one area it completely dances around is the local codepage. 
And from my use of Encode::decode and
encode, it is the one piece of information that it seems I am required
to know when converting local strings to UTF8. 
The data isn't of unknown origin
- it just came in from stdin or a local file that I know is in local
codepage. \

And yes, figuring out the local code
page on unix is particularly squirrelly. The codepage for fr_CA.ISOxxx
is pretty easy but what about fr_CA and fr ? There
are a lot of aliases and rules involved so that the locale is just about
useless (in one case you can tell it is shift-JIS because the j
in the locale is capitalized (I wish I was kidding!). 

As a number of others have suggested
to me it seems like something basic that Perl should absolutely know someplace
internally. But I have yet to find an API to get it. 
If there was some way to do decode/encode
without having to know the local codepage that would make me happy to.
I just want to get encode/decode to work. 







David Graff [EMAIL PROTECTED]

11/07/2005 08:20 PM




To
David Schlegel/Lexington/[EMAIL PROTECTED]


cc
perl-unicode@perl.org


Subject
Re: Converting between UTF8
and local codepage without specifying local codepage









[EMAIL PROTECTED] said:
 Is there someway to convert from whatever the local codepage
is to utf8
 and back again ? 

 The Encode::encode and decode routines require passing a specific
 codepage to do the conversion but finding out what the local
codepage
 is is very tricky across different platforms, particularly UNIX where
it
 is hard to determine. 

Have you looked at the perllocale man page? It's not
clear to me that
figuring out the local codepage (i.e. the locale)
is particularly hard
on unix systems -- that's what the POSIX locale protocol is
for. (I 
don't know how you would figure it out on MS-Windows systems, but that's

more a matter of me being blissfully ignorant of MS software generally.)

If you're dealing with data of unknown origin, and it's in some clearly

non-ASCII, non-Unicode encoding, then being able to detect its character

set is a speculative matter, especially for text in languages that use

single-byte encodings.

The Encode::Guess module can help in detecting any of the unicode
encodings and most of the multi-byte non-unicode sets (i.e. the legacy
code
pages for Chinese, Japanese and Korean), but it can't help much when it
comes to correctly detecting, say, ISO Cyrillic vs. ISO Greek (vs. Thai
vs.
Arabic ...), let alone Latin1 vs. Latin2.


David Graff