On Sat, Mar 19, 2005 at 05:07:49PM -0600, Rod Adams wrote: : I propose that we make a few decisions about strings in Perl. I've read : all the synopses, several list threads on the topic, and a few web : guides to Unicode. I've also thought a lot about how to cleanly define : all the string related functions that we expect Perl to have in the face : of all this expanded Unicode support. : : What I've come up with is that we need a rule that says: : : A single string value has a single encoding and a single Unicode Level : associated with it, and you can only talk to that value on its own : terms. These will be the properties "encoding" and "level".
You've more or less described the semantics available at the "use bytes" level, which basically comes down to a pure OO approach where the user has to be aware of all the types (to the extent that OO doesn't hide that). It's one approach to polymorphism, but I think it shortchanges the natural polymorphism of Unicode, and the approach of Perl to such natural polymorphisms as evident in autoconversion between numbers and strings. That being said, I don't think your view is so far off my view. More on that below. : However, it should be easy to coerce that string into something that : behaves some other way. The question is, "how easy?" You're proposing a mechanism that, frankly, looks rather intrusive and makes my eyes glaze over as a representative of the Pooh clan. I think the typical user would rather have at least the option of automatic coercion in a lexical scope. But let me back up a bit. What I want to do is to just widen your definition of a string type slightly. I see your current view as a sort of degenerate case of my view. Instead of viewing a string as having an exact Unicode level, I prefer to think of it as having a natural maximum and minimum level when it's born, depending on the type of data it's trying to represent. A memory buffer naturally has a mininum and maximum Unicode level of "bytes". A typical Unicode string encoded in, say, UTF-8, has a minimum Unicode level of bytes, and maximum of "chars" (I'm using that to represent language-dependent graphemes here.) A Unicode string revealed by an abstract interface might not allow any bytes-level view, but use codepoints for the natural minimum, or even graphemes, but still allow any view up to chars, as long as it doesn't go below codepoints. A given lexical scope chooses a default Unicode view, which can be naturally mapped for any data types that allow that view. The question is what to do outside of that range. (Inside that range, I suspect we can arrange to find a version of index($str,$targ) that works even if $str and $targ aren't the same exact type, preferably one that works at the current Unicode level. I think the typical user would prefer that we find such a function for him without him having to play with coercions.) If the current lexical view is outside the range allowed by the current, I think the default behavior is different looking up than down. If I'm working at the chars level, then everything looks like chars, even if it's something smaller. To take an extreme case, suppose I do a chop on a string that is allows the byte view as the highest level, that is, a byte buffer. I always get the last byte of the string, even if the data could conceivably be interpreted as some other encoding. For that string, the bytes *are* the characters. They're also the codepoints, and the graphemes. Likewise, a string that is max codepoints will behave like a codepoint buffer even under higher levels. This seems very dwimmy to me. Going the other way, if a lower level tries to access a string that is minimum a higher level, it's just illegal. In a bytes lexical context, it will force you to be more specific about what you mean if you want to do an operation on a string that requires a higher level of abstraction. As a limiting case, if you force all your incoming strings to be minimum == maximum, and write your code at the bytes level, this degenerates to your proposed semantics, more or less. I don't doubt that many folks would prefer to program at this explicit level where all the polymorphism is supplied by the objects, but I also think a lot of folks would prefer to think at the graphemes or chars level by default. It's the natural human way of chunking text. I know this view of string polymorphism makes a bit more work for us, but it's one of the basic Perl ideals to try to do a lot of vicarious work in advance on behalf of the user. That was Perl's added value over other languages when it started out, both on the level of mad configuration and on the level of automatic str/num/int polymorphism. I think Perl 6 can do this on the level of Str polymorphism. When it comes to Unicode, most other OO languages are falling into the Lisp trap of expecting the user think like the computer rather than the computer like the user. That's one of the few ideas from Lisp I'm trying very hard *not* to steal. Larry