We'll basically need 4 levels of string support:
,--[ Larry Wall
]--------------------------------------------------------
| level 0 byte == character, "use bytes" basically
| level 1 codepoint == character, what we seem to be aiming for,
vaguely
| level 2 grapheme == character, what the user usually wants
| level 3 letter == character, what the "current language" wants
`-------------------------------------------------------------------- --
--
Yes, and I'm boldly arguing that this is the wrong way to go, and I guarantee you that you can't find any other string or encoding library out there which takes an approach like that, or anyone asking for one. I'm eager for Larry to comment.
I'm no Larry, either :-) but I think Larry is *not* saying that the
"localeness" or "languageness" should hang off each string (or *shudder*
off each substring). What I've seen is that Larry wants the "level" to
be a lexical pragma (in Perl terms). The "abstract string" stays the
same, but the operative level decides for _some_ ops what a "character
stands for.
That makes a lot of sense to me, and I'd further it by saying that levels 2 and 3 don't mean that we need to have "grapheme" or "letter" data types, per se. (If we tried to have those, we'd need properties databases to go with them, and we'd go crazy.)
For example, usually /./ means "match one Unicode code point" (a CCS
character code). But one can somehow ratchet the level up to 2 and make
it mean "match one Unicode base character, followed by zero or more
modifier characters". For level 3 the language (locale) needs to be
specified.
Another example could be that at level 2 (and 3), maybe "eq" automatically normalizes before doing string comparisons, and at levels 1 and 0 it doesn't.
(If Larry is really saying that the "locale" should be an attribute of the string value, I'm on the barricades with you, holding cobblestones and Molotov cocktails...)
It's nice to have company!
JEff