OK, I think I get 'use byte' vs. 'use utf8' vs. neither. But, there
are still locales. To illustrate the issues, let me get concrete here
for a minute. (Please, no construction jokes.)
Consider the apparently straightforward:
@a = sort @b;
AFAIK, in the absence of pragmas, Perl 5.6 will compare the elements
of @b character-by-character, regardless of whether the characters in
question have one- or multi-byte representations. This behavior
should be well-defined even if some elements of @b are Unicode and
some aren't. (I'm glad that this is the default.)
Now, from the sublime to the ridiculous:
C locales define various characters characteristics. But they are so
limited as to be very difficult to use. More to the point, they are
defined in terms of the C type 'char', so there can be no support for
multi-byte character encodings unless your 'char' type is larger than
one byte. Furthermore, C makes no provision for cross-locale
processing -- you can't meaningfully collate e.g. English and Hebrew.
Therefore, if C<sort> is told to use locales, must consider all
strings to contain characters from the current (run-time) locale.
Now, comes the key bit: It's entirely possible for characters, even
characters considered to be from the current locale, to be encoded in
a number of ways. In other words: STRING ENCODING AND CHARACTER SET
ARE ORTHOGONAL. So I propose that OPs' string encoding states and
character set states be orthogonal in a user-visible way.
(Granted, C locales are limited to one-byte characters, so anything
with ord() > 255 has no place in a locale-charset string. But Topaz
uses C++, and C++ locales apply not only to 'char's, but also to
'wchar_t's -- wide characters. AFAIK, there is no technical obstacle
to a C++ implementation's providing Unicode-compatible locales.
Besides, some intrepid user may want to use a non-Unicode character
set while obeying each string's current encoding.)
So: The _string_encoding_ state of each OP must be one of these:
0. the default -- follow each string's current encoding
1. "use byte" -- all strings are one-byte
2. "use utf8" -- all strings are UTF-8 (*not* necessarily Unicode!)
And the _character_set_ state of each OP must one of these:
0. the default -- characters are Latin-1, UTF-8 is Unicode
1. "use locale" -- characters are $ENV{LANG} (set at runtime)
If you just want to stop here, then please consider the above as a
proposed specification for the interaction of UTF-8 and locales.
{{ NEW FEATURE ALERT }}
Seeing the above list of pragmas triggers my generalization reflex.
So, how about this:
0. C<no encoding> == the default
1. C<use encoding 'utf8'> == C<use utf8>
2. C<use encoding 'byte'> == C<use byte>
Combined with this:
0. C<no charset> == the default
1. C<use charset 'locale'> == C<use locale>
This interface would also provide a hook for any encodings we might
support in future:
use encoding 'byte2big'; == force two-byte big-endian characters,
without forcing their charset
or:
use encoding 'byte4big'; == force four-byte big-endian characters
use charset 'iso10646'; == force ISO 10646 (Unicode superset)
So, what do you think?
--
Chip Salzenberg - a.k.a. - <[EMAIL PROTECTED]>
"He's Mr. Big of 'Big And Tall' fame." // MST3K