Locales: An Analysis

Chip Salzenberg Thu, 3 Feb 2000 20:33:23 -0800
OK, I think I get 'use byte' vs. 'use utf8' vs. neither.  But, there
are still locales.  To illustrate the issues, let me get concrete here
for a minute.  (Please, no construction jokes.)

Consider the apparently straightforward:
    @a = sort @b;

AFAIK, in the absence of pragmas, Perl 5.6 will compare the elements
of @b character-by-character, regardless of whether the characters in
question have one- or multi-byte representations.  This behavior
should be well-defined even if some elements of @b are Unicode and
some aren't.  (I'm glad that this is the default.)

Now, from the sublime to the ridiculous:

C locales define various characters characteristics.  But they are so
limited as to be very difficult to use.  More to the point, they are
defined in terms of the C type 'char', so there can be no support for
multi-byte character encodings unless your 'char' type is larger than
one byte.  Furthermore, C makes no provision for cross-locale
processing -- you can't meaningfully collate e.g. English and Hebrew.

Therefore, if C<sort> is told to use locales, must consider all
strings to contain characters from the current (run-time) locale.

Now, comes the key bit: It's entirely possible for characters, even
characters considered to be from the current locale, to be encoded in
a number of ways.  In other words: STRING ENCODING AND CHARACTER SET
ARE ORTHOGONAL.  So I propose that OPs' string encoding states and
character set states be orthogonal in a user-visible way.

(Granted, C locales are limited to one-byte characters, so anything
with ord() > 255 has no place in a locale-charset string.  But Topaz
uses C++, and C++ locales apply not only to 'char's, but also to
'wchar_t's -- wide characters.  AFAIK, there is no technical obstacle
to a C++ implementation's providing Unicode-compatible locales.
Besides, some intrepid user may want to use a non-Unicode character
set while obeying each string's current encoding.)

So: The _string_encoding_ state of each OP must be one of these:

  0. the default -- follow each string's current encoding
  1. "use byte"  -- all strings are one-byte
  2. "use utf8"  -- all strings are UTF-8 (*not* necessarily Unicode!)

And the _character_set_ state of each OP must one of these:

  0. the default  -- characters are Latin-1, UTF-8 is Unicode
  1. "use locale" -- characters are $ENV{LANG} (set at runtime)

If you just want to stop here, then please consider the above as a
proposed specification for the interaction of UTF-8 and locales.

{{ NEW FEATURE ALERT }}

Seeing the above list of pragmas triggers my generalization reflex.
So, how about this:

  0. C<no encoding>         == the default
  1. C<use encoding 'utf8'> == C<use utf8>
  2. C<use encoding 'byte'> == C<use byte>

Combined with this:

  0. C<no charset>           == the default
  1. C<use charset 'locale'> == C<use locale>

This interface would also provide a hook for any encodings we might
support in future:
  use encoding 'byte2big'; == force two-byte big-endian characters,
                                without forcing their charset
or:
  use encoding 'byte4big'; == force four-byte big-endian characters
  use charset 'iso10646';  == force ISO 10646 (Unicode superset)

So, what do you think?
-- 
Chip Salzenberg          - a.k.a. -           <[EMAIL PROTECTED]>
        "He's Mr. Big of 'Big And Tall' fame."  // MST3K
Locales: An Analysis

Reply via email to