Chip Salzenberg writes:
: So: The _string_encoding_ state of each OP must be one of these:
:   0. the default -- follow each string's current encoding
:   1. "use byte"  -- all strings are one-byte
:   2. "use utf8"  -- all strings are UTF-8 (*not* necessarily Unicode!)

There is no 2.

: And the _character_set_ state of each OP must one of these:
:   0. the default  -- characters are Latin-1, UTF-8 is Unicode
:   1. "use locale" -- characters are $ENV{LANG} (set at runtime)

I would actually like to avoid locales if at all possible.  They are
not the right approach to sorting or much of anything else.  I recommend
the Unicode Consortium reports for a thorough discussion of what's needed
at the higher levels of abstraction.

: Seeing the above list of pragmas triggers my generalization reflex.
: So, how about this:
:   0. C<no encoding>         == the default
:   1. C<use encoding 'utf8'> == C<use utf8>

Again, that doesn't seem to do what you think anymore.

:   2. C<use encoding 'byte'> == C<use byte>
: Combined with this:
:   0. C<no charset>           == the default
:   1. C<use charset 'locale'> == C<use locale>

"Use locale!?!  Slowly I turned...step by step...inch by inch..."

: This interface would also provide a hook for any encodings we might
: support in future:
:   use encoding 'byte2big'; == force two-byte big-endian characters,
:                                 without forcing their charset
: or:
:   use encoding 'byte4big'; == force four-byte big-endian characters
:   use charset 'iso10646';  == force ISO 10646 (Unicode superset)

Not really a superset anymore, unless you're into defining your own
characters outside of U+10FFFF.

: So, what do you think?

Nothing we are doing precludes doing that eventually, should we happen
to find it interesting some day, which does not seem to be today.
UTF-8 is taking over the world, and is quite capable of representing
all of ISO-10646 already.  It's also already capable of representing
other character sets with not much tweaking.

As for other encodings, I'm just not terribly interested in rewriting
all the opcodes to support them all simultaneously.  That's what
Unicode is supposed to be getting us away from, after all.

Earlier I indicated that mostly only Asians will be interesed in "OEM"
character sets, but I have to back up a bit and admit that I could be
wrong about that.  It's possible we might apply the same legacy character
set processing to handling I/O channels that want a "legacy" of UTF-16
or UCS-4.

I think if we ever do support fixed-width wide characters in Perl
internally, we might just jump straight to 32 bits.  I'd love to forget
all that characters-fit-in-16-bits-except-when-they-don't crapola.  Not
to mention the fact that some character codes can't be encoded in it
because they're reserved for half a surrogate character, so it's
entirely Unicode specific.

The more I see of UTF-16 the better I dislike it.  BOMs away...

[Tell us what you really think, Larry.]

At any rate, in the unlikely event that we do ever go with fixed-width
characters internally, I suspect we'll try to make it as transparent as
possible, like we're doing now with UTF-8.  But I really think the
interfaces are where the battle will be fought, and I think UTF-8 will
prevail there in the long run, despite the early example of Java.
Linux is going to be all UTF-8, and between Linux and Java I think Java
will find itself being wagged.


Reply via email to