Re: Anachronistic Acronyms in Parrot?

Dan Sugalski Thu, 20 Sep 2001 09:10:13 -0700
At 11:57 AM 9/20/2001 -0400, Guido van Rossum wrote:
> > It'll probably be more like:
> >
> >     void *name = Get_String_Data(p1, PARROT_UNICODE, encoding);
> >
> > And yes, the void * is deliberate (though subject to change) since I'm
> > being generic--how do you know that you're getting back a series of bytes?
> > encoding might have the UTF_32 constant in it.
>
>Isn't *everything* a series of bytes? :-)

Don't *make* me dig out the TOPS-20 machines. You really don't want that... :)

Seriously, if the string data is in UTF-16 it ought to be treated as a 
series of 16-bit integers, or 32 bit integers if it's UTF-32. IIRC 
Shift-JIS and Big-5 are also 16 bit quantities, but my references are at 
home at the moment so I'm not sure. (I know they encode down to a 
bytestream, but internally I'd as soon leave them as fixed-width characters 
of the appropriate size as it makes life easiser)

> > For just fetching the abstract string structure it'd be more like:
> >
> >    Parrot_string *name = Get_String(p1);
> >
> > FWIW, anyone using char * in the Parrot source in areas that do not
> > directly involve interface with the external world (system calls, other
> > people's libraries) will find themselves on the other end of the Big 
> Mallet
> > of Programmer Chastisement. :-)
>
>Sure, but please explain.

That's sorta vague, so I'll use it as an opportunity to launch into my big 
"Why string data sucks, and what Parrot's doing to deal with it" speech. 
(Keep telling yourself it was your own fault... :)

String data, generally speaking, has the following characteristics:

    a series of code points
    A character set (ASCII, EBCDIC, Unicode, whatever)
    An encoding (UTF-8, UTF-16, 32-bit integers)
    A language
    Length in bytes
    Length in code points
    Length in Glyphs

The string structure in parrot holds all that stuff, and the routines to 
manipulate the strings use it.

Now, the first question I always get when rolling this out is "Why? What's 
wrong with Unicode?" (By which people usually mean utf-8 if they even 
realize Unicode has several ways to represent the abstract code points) And 
the answer is because Unicode conversion is both lossy and unnecessarily 
slow in many cases.

The slow part's the easier to deal with. If my native data is, say, EBCDIC, 
and 99% of what I deal with is EBCDIC, why the heck should I convert? The 
same goes if my native data is Shift-JIS, or Big-5 traditional, or Big-5 
simplified, or even ASCII with top half filled with national characters. 
(Greek, French, Romanian, whatever) Waste of time, and I object to wasting 
time.

Unicode as a default only really makes sense (to me at least) from a 
US-ASCII standpoint, since it's free there. The people who'd actually *use* 
it heavily have other solutions in place already that work fine. (If they 
want to use Unicode I'm OK with it, but who the heck am I to tell a billion 
plus Chinese that I have a better solution for encoding their language than 
they do? I don't even *speak* chinese)

The lossy part's a separate issue with a separate set of problems, and I'm 
not even talking about characters not in the Unicode character set. When 
you transcode to Unicode you lose some important data, specifically the 
language that the string originally came from.

As an extreme example, let's take Chinese and Japanese. The japanese kanji 
characters overlap the chinese character range. That makes sense, since the 
glyphs are essentially the same. (We're dodging a bunch of 
interpretation/language/image issues here--that's the Unicode Consortium's 
problem) Unfortunately the Japanese and Chinese sort and compare their 
characters differently. If we're pure Unicode there's no way to tell 
whether one string is greater than another, or how to sort the strings. We 
could use the generic Unicode algorithm, but as far as I can tell that's 
wrong for *everyone*. I'd rather be right for everyone, or at least for 
some people.

I've also been told that the problem even exists in Western European 
languages--some languages consider accented (or umlauted, or tilde'd, or 
whatever) characters different from the un-accented version, and some 
don't. And in some cases two different languages will sort the same mix of 
accented and unaccented characters differently. (I can't pull an example 
out of the air at the moment, so I might be wrong here. I'm not familiar 
with the character sorting schemes for all the languages in Western Europe, 
so I'm taking this on faith)

Anyway, there you go. To completely represent a string you need lots of 
parts. The reference for the bits is:

a series of code points: Gotta have the raw data

A character set: It helps to know what character 12 actually *is*

An encoding: So we can pick characters out of the raw data. (Helps to know 
how big a character is...)

A language: So we can properly interpret the data in those cases where we 
care, generally for comparison and sorting

Length in bytes: So we know how much raw data we have

Length in code points: Because knowing this is nice too

Length in Glyphs: This one's still up in the air (we might not do it), but 
it's nice in those cases where multiple code points collapse into a single 
glyph on-screen


Make sense? Parrot's set up such that the libraries to handle a particular 
kind of data (EBCDIC, Unicode, Shift-JIS, Big5/traditional, Finnish ASCII) 
will be dynamically loadable so we can add them after the fact and you 
don't have to pay the memory price.

We will, FWIW, transcode to Unicode in those cases where we have to deal 
with data in multiple encodings and shouldn't just throw an error. While 
LCDs are bad, they're better than nothing...

For those folks who've made it this far and are starting (or continuing) to 
froth over efficiency, I'll point out that for most of the string work that 
the interpreter (any interpreter) needs to do it can, if things match (and 
we'll make sure they do), treat the character data as a stream of n-byte 
characters. When doing an exact string match with the regex engine, for 
example, it doesn't really care what a character means as long as it's the 
same. And sets of characters (word, digit, whitespace, whatever) are just 
sets of characters--as long as it's got the set that matches the encoding 
of the RE and the string to be searched, it's happy. They're all just a 
bunch of bits after all.

                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk
Re: Anachronistic Acronyms in Parrot?

Reply via email to