At 11:57 AM 9/20/2001 -0400, Guido van Rossum wrote:
> > It'll probably be more like:
> >
> > void *name = Get_String_Data(p1, PARROT_UNICODE, encoding);
> >
> > And yes, the void * is deliberate (though subject to change) since I'm
> > being generic--how do you know that you're getting back a series of bytes?
> > encoding might have the UTF_32 constant in it.
>
>Isn't *everything* a series of bytes? :-)
Don't *make* me dig out the TOPS-20 machines. You really don't want that... :)
Seriously, if the string data is in UTF-16 it ought to be treated as a
series of 16-bit integers, or 32 bit integers if it's UTF-32. IIRC
Shift-JIS and Big-5 are also 16 bit quantities, but my references are at
home at the moment so I'm not sure. (I know they encode down to a
bytestream, but internally I'd as soon leave them as fixed-width characters
of the appropriate size as it makes life easiser)
> > For just fetching the abstract string structure it'd be more like:
> >
> > Parrot_string *name = Get_String(p1);
> >
> > FWIW, anyone using char * in the Parrot source in areas that do not
> > directly involve interface with the external world (system calls, other
> > people's libraries) will find themselves on the other end of the Big
> Mallet
> > of Programmer Chastisement. :-)
>
>Sure, but please explain.
That's sorta vague, so I'll use it as an opportunity to launch into my big
"Why string data sucks, and what Parrot's doing to deal with it" speech.
(Keep telling yourself it was your own fault... :)
String data, generally speaking, has the following characteristics:
a series of code points
A character set (ASCII, EBCDIC, Unicode, whatever)
An encoding (UTF-8, UTF-16, 32-bit integers)
A language
Length in bytes
Length in code points
Length in Glyphs
The string structure in parrot holds all that stuff, and the routines to
manipulate the strings use it.
Now, the first question I always get when rolling this out is "Why? What's
wrong with Unicode?" (By which people usually mean utf-8 if they even
realize Unicode has several ways to represent the abstract code points) And
the answer is because Unicode conversion is both lossy and unnecessarily
slow in many cases.
The slow part's the easier to deal with. If my native data is, say, EBCDIC,
and 99% of what I deal with is EBCDIC, why the heck should I convert? The
same goes if my native data is Shift-JIS, or Big-5 traditional, or Big-5
simplified, or even ASCII with top half filled with national characters.
(Greek, French, Romanian, whatever) Waste of time, and I object to wasting
time.
Unicode as a default only really makes sense (to me at least) from a
US-ASCII standpoint, since it's free there. The people who'd actually *use*
it heavily have other solutions in place already that work fine. (If they
want to use Unicode I'm OK with it, but who the heck am I to tell a billion
plus Chinese that I have a better solution for encoding their language than
they do? I don't even *speak* chinese)
The lossy part's a separate issue with a separate set of problems, and I'm
not even talking about characters not in the Unicode character set. When
you transcode to Unicode you lose some important data, specifically the
language that the string originally came from.
As an extreme example, let's take Chinese and Japanese. The japanese kanji
characters overlap the chinese character range. That makes sense, since the
glyphs are essentially the same. (We're dodging a bunch of
interpretation/language/image issues here--that's the Unicode Consortium's
problem) Unfortunately the Japanese and Chinese sort and compare their
characters differently. If we're pure Unicode there's no way to tell
whether one string is greater than another, or how to sort the strings. We
could use the generic Unicode algorithm, but as far as I can tell that's
wrong for *everyone*. I'd rather be right for everyone, or at least for
some people.
I've also been told that the problem even exists in Western European
languages--some languages consider accented (or umlauted, or tilde'd, or
whatever) characters different from the un-accented version, and some
don't. And in some cases two different languages will sort the same mix of
accented and unaccented characters differently. (I can't pull an example
out of the air at the moment, so I might be wrong here. I'm not familiar
with the character sorting schemes for all the languages in Western Europe,
so I'm taking this on faith)
Anyway, there you go. To completely represent a string you need lots of
parts. The reference for the bits is:
a series of code points: Gotta have the raw data
A character set: It helps to know what character 12 actually *is*
An encoding: So we can pick characters out of the raw data. (Helps to know
how big a character is...)
A language: So we can properly interpret the data in those cases where we
care, generally for comparison and sorting
Length in bytes: So we know how much raw data we have
Length in code points: Because knowing this is nice too
Length in Glyphs: This one's still up in the air (we might not do it), but
it's nice in those cases where multiple code points collapse into a single
glyph on-screen
Make sense? Parrot's set up such that the libraries to handle a particular
kind of data (EBCDIC, Unicode, Shift-JIS, Big5/traditional, Finnish ASCII)
will be dynamically loadable so we can add them after the fact and you
don't have to pay the memory price.
We will, FWIW, transcode to Unicode in those cases where we have to deal
with data in multiple encodings and shouldn't just throw an error. While
LCDs are bad, they're better than nothing...
For those folks who've made it this far and are starting (or continuing) to
froth over efficiency, I'll point out that for most of the string work that
the interpreter (any interpreter) needs to do it can, if things match (and
we'll make sure they do), treat the character data as a stream of n-byte
characters. When doing an exact string match with the regex engine, for
example, it doesn't really care what a character means as long as it's the
same. And sets of characters (word, digit, whitespace, whatever) are just
sets of characters--as long as it's got the set that matches the encoding
of the RE and the string to be searched, it's happy. They're all just a
bunch of bits after all.
Dan
--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
[EMAIL PROTECTED] have teddy bears and even
teddy bears get drunk