Re: String Theory

Larry Wall Sat, 19 Mar 2005 19:11:20 -0800

On Sat, Mar 19, 2005 at 05:07:49PM -0600, Rod Adams wrote:
: I propose that we make a few decisions about strings in Perl. I've read
: all the synopses, several list threads on the topic, and a few web
: guides to Unicode. I've also thought a lot about how to cleanly define
: all the string related functions that we expect Perl to have in the face
: of all this expanded Unicode support.
: 
: What I've come up with is that we need a rule that says:
: 
: A single string value has a single encoding and a single Unicode Level
: associated with it, and you can only talk to that value on its own
: terms. These will be the properties "encoding" and "level".


You've more or less described the semantics available at the "use
bytes" level, which basically comes down to a pure OO approach where
the user has to be aware of all the types (to the extent that OO
doesn't hide that).  It's one approach to polymorphism, but I think
it shortchanges the natural polymorphism of Unicode, and the approach
of Perl to such natural polymorphisms as evident in autoconversion
between numbers and strings.  That being said, I don't think your
view is so far off my view.  More on that below.

: However, it should be easy to coerce that string into something that
: behaves some other way.

The question is, "how easy?"  You're proposing a mechanism that,
frankly, looks rather intrusive and makes my eyes glaze over as a
representative of the Pooh clan.  I think the typical user would rather
have at least the option of automatic coercion in a lexical scope.

But let me back up a bit.  What I want to do is to just widen your
definition of a string type slightly.  I see your current view as a
sort of degenerate case of my view.  Instead of viewing a string as
having an exact Unicode level, I prefer to think of it as having a
natural maximum and minimum level when it's born, depending on the
type of data it's trying to represent.  A memory buffer naturally has
a mininum and maximum Unicode level of "bytes".  A typical Unicode
string encoded in, say, UTF-8, has a minimum Unicode level of bytes,
and maximum of "chars" (I'm using that to represent language-dependent
graphemes here.)  A Unicode string revealed by an abstract interface
might not allow any bytes-level view, but use codepoints for the
natural minimum, or even graphemes, but still allow any view up
to chars, as long as it doesn't go below codepoints.

A given lexical scope chooses a default Unicode view, which can be
naturally mapped for any data types that allow that view.  The question
is what to do outside of that range.  (Inside that range, I suspect
we can arrange to find a version of index($str,$targ) that works
even if $str and $targ aren't the same exact type, preferably one
that works at the current Unicode level.  I think the typical user
would prefer that we find such a function for him without him having
to play with coercions.)

If the current lexical view is outside the range allowed by the
current, I think the default behavior is different looking up than
down.  If I'm working at the chars level, then everything looks like
chars, even if it's something smaller.  To take an extreme case,
suppose I do a chop on a string that is allows the byte view as the
highest level, that is, a byte buffer.  I always get the last byte
of the string, even if the data could conceivably be interpreted as
some other encoding.  For that string, the bytes *are* the characters.
They're also the codepoints, and the graphemes.  Likewise, a string
that is max codepoints will behave like a codepoint buffer even under
higher levels.  This seems very dwimmy to me.

Going the other way, if a lower level tries to access a string that is
minimum a higher level, it's just illegal.  In a bytes lexical context,
it will force you to be more specific about what you mean if you want
to do an operation on a string that requires a higher level of abstraction.

As a limiting case, if you force all your incoming strings to be
minimum == maximum, and write your code at the bytes level, this
degenerates to your proposed semantics, more or less.  I don't doubt
that many folks would prefer to program at this explicit level where
all the polymorphism is supplied by the objects, but I also think a
lot of folks would prefer to think at the graphemes or chars level
by default.  It's the natural human way of chunking text.

I know this view of string polymorphism makes a bit more work for us,
but it's one of the basic Perl ideals to try to do a lot of vicarious
work in advance on behalf of the user.  That was Perl's added value
over other languages when it started out, both on the level of mad
configuration and on the level of automatic str/num/int polymorphism.
I think Perl 6 can do this on the level of Str polymorphism.

When it comes to Unicode, most other OO languages are falling into
the Lisp trap of expecting the user think like the computer rather
than the computer like the user.  That's one of the few ideas from
Lisp I'm trying very hard *not* to steal.

Larry

Re: String Theory

Reply via email to