Strings, charsets, and encodings, oh my!

Dan Sugalski Thu, 11 Nov 2004 08:42:51 -0800

Or something like that.

Anyway, I'm nailing down the last bits of functionality for the changes to the string system. There's still going to be a fair amount of cleanup (including the eradication of some globals) once this is in and merged, but I wanted to give folks a heads up, and a refresher on the scheme going in.

We're going to continue the parrot tradition of confusing naming, referring to any length-delimited wad of bytes as a string. They go in string registers, PMCs can hold 'em, they're maintained internally with the STRING structure, and so on. (This is true and not going to change, regardless of whether it's a good idea or not) Each string has attached to it an encoding and a charset.

Strings are a sequence of grapehemes. A grapheme is the smallest logical unit of text. We'd call 'em characters, except there are issues there with typography so we're not going to. A grapheme is composed of one or more code points. And a code point is a 32 bit integer.

The encoding code is responsible for managing the underlying byte buffer. It's the layer that translates between code points and real bytes, making the buffer *look* like a contiguous sequence of 32-bit integers, even if it really isn't. (If, for example, the buffer is UTF-8 data, where a 32-bit integer can be between 1 and 6 bytes, or the buffer is sparse, or zip/gzip/bzip compressed)

The charset code is responsible for managing the graphemes in a string, translating between graphemes and code points, giving basic meaning to grapehemes, and doing basic manipulation of the graphemes. In this case basic meaning is classification -- is this grapheme a whitespace/alpha/numeric/punctuation/line break character, and basic manipulation is case changes and insertions and deletions.

A picture looks something like:

     Parrot string ops
             |
             |
             v
       charset code
             |
             |
             v
       encoding code
             |
             |
             v
         raw data

So your parrot string ops (and C API calls) always talk to a string's charset code, which then will talk to the encoding code (maybe--it's OK for this code to cheat if it knows its OK), which then dives into the actual buffer data. Parrot string ops never go past the charset.

For our purposes, graphemes and code points are all *virtual* -- that is, the values may not be directly represented in the underlying buffer. If the buffer is gzipped the encoding layer will do the decompression as it needs to so it can present code points to the charset layer, and the charset layer synthesizes code points as it needs to if it needs to. Byte access, on the other hand, is always real -- that is, when you ask for byte N from a string you will always get the real byte N, or an exception if this byte isn't accessible.

This real/virtual access is in for reasons of practicality. Code should *never* be accessing strings by byte. The only reason to access things by byte is if you want the real data in the buffer for something like IO or other low-level things.

When things are done, encodings and charsets will be dynamically loadable -- that is, while parrot will ship with quite a few, only the ones you actually need will be loaded in. This makes for a smaller runtime footprint (so no need to load in ICU if your program is all about Latin-1 data) and for easier upgrading and extension (We don't have any of the asian charsets, nor do we have most of the ISO-8859 sets. Yet).

Now, with this in mind it's *very* important to draw a distinction between what is an encoding and what is a charset. This gets somewhat muddled, especially since many of the standards for this stuff define both encoding and charset semantics. This means that we have to be somewhat careful, and it means that we will have charsets and encodings with the same names in some circumstances.

Things which define grapheme semantics are charsets. ASCII is a charset. ISO-8859-x is a charset. Unicode is a charset. Shift-JIS is a charset. EBCDIC is an abomination, but it's also a charset. RAD-50 is a charset. These all define how graphemes behave and what they mean.

Things which define how bytes dance are encodings. UTF-8 is an encoding, UTF-16 is an encoding. Byte is an encoding. (Though I'm calling it fixed_8) Shift-JIS is an encoding. RAD-50 is an encoding.

Now, in some circumstances semantics are mushed together enough that it's somewhat difficult to tease them apart (like in many of the asian charset/encoding standards) so we'll have some fun there. We'll live, and worst case everyone just pivots to unicode and pretends not to worry about it.

It is also important to keep in mind that not all charset/encoding pairs are allowable, and that charsets can require certain encodings to be used with them. Unicode, for example, won't allow the RAD-50 or byte encodings, since they don't have sufficient range. ASCII *could* use the UTF-32 encoding if it wanted, though that'd be wasteful. Charsets may have a preferred encoding as well, which is also fine, though we'll prefer they not worry too much about that. (So we can swap in compressing and sparse encodings, for example)

Anyway, with all this, things should work out reasonably well. The bytecode-level API has already been specified, which allows pretty much all of the underlying complexity to be hidden (and, indeed, allows the existence of non-unicode data to be hidden if that's what you really want) from bytecode programs, which is fine.

This should be all checked in and working in the next day or two, at which point I want to merge back into the main tree. We'll use Unicode support at that point, but putting together a Unicode charset library should be straightforward. We will probably want to take a look at some sort of pmc-class-style preprocessing code, since the charset libraries are all awfully similar, so inheriting's not a bad thing to do. OTOH, I'm not sure we'll have enough of these to matter.

The basic libraries at final merge, if you're following along, will be:

   encoding: fixed_8 (byte == codepoint)
    charset: binary, ascii, ISO-8859-1 (latin-1)

I'd like to get Unicode up to speed quickly at that point, as well as either Shift-JIS or one of the GB sets, though I'm not sure I'll have the time to do so. From there we'll see where we go. -- Dan

--------------------------------------it's like this-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk

Strings, charsets, and encodings, oh my!

Reply via email to