Re: ICU incorporation and string changes heads-up

Jeff Clites Fri, 09 Apr 2004 14:11:19 -0700

On Apr 9, 2004, at 7:19 AM, Leopold Toetsch wrote:

Jeff Clites <[EMAIL PROTECTED]> wrote:
I've sent my patch in through RT--it's [perl #28405]!
Phew, that's huge. I'd really like to have smaller patches that do it
step by step.

Yes, I know it got quite large--sorry about that, I know it makes things more difficult. Mostly it was all-or-nothing, though--changes to string.c, and then dealing with the consequence of those changes. (It would have been smaller if I had not added the "interpreter" argument to the string API which lacked it--updating the places they are used is probably half of the patch or so.)

In addition to may responses below, I hope to have a better-organized explanation of the approach I've taken, and the rationale behind it.

But anyway, the patch must have been a lot of work so
lets see and make the best out of it.

Some questions:

First:
- What is string->representation? It seems only to contain
  string_rep_{one,two,four}.
- How does UTF8 fit here?
- Where are encoding and chartype now?

The key idea is to move to a model in which a string is not represented as a bag of bytes + associated encoding, but rather conceptually as an array of (abstract) characters, which boils down to an array of numbers, those numbers being the Unicode code point of the character. The easiest way to do this would be to represent a string using an array of 32-bit ints, but this is a large waste of space in the common case of an all-ASCII string. So what I'm doing is sticking to the idea of modeling a string as conceptually an array of 32-bit numbers, but optimizing by using just a uint8_t[] if all of the characters happen to have numerical value < 2^8, uint16_t[] if some are outside of that range but still < 2^16, and finally uint32_t[] in the rare case it contains characters above that range.

So internally, strings don't have an associated encoding (or chartype or anything)--if you want to know what the N-th character is, you just jump to index N for the appropriate datatype, and that number is your answer.

In my view, the fundamental question you can ask a string is "what's your N-th character", and the fundamental thing you do with the answer is go look up what properties that character has (eg, sort ordering, case mapping, value as a digit, etc.).

In this model, an "encoding" is a serialization algorithm for strings, and that's it (much like Data::Dumper defines a serialization format for other Perl types). A particular encoding is a particular strategy for serializing a string into a bag of bytes (or vice-versa), almost always for interchange via I/O (which is always byte-based). The split into separate "encoding" and "chartype" ends up being undesirable. (I'd argue that such a split is a possible internal design option for a transcoding library, but not part of the conceptual API of a string library. So ICU may have such a split internally, but parrot doesn't need to.) In particular, this metadata is invariably specified as a single parameter--in an XML declaration is called "encoding", in MIME headers it's called "charset", but I don't know of any interchange format which actually tries to specify this sort of thing via two separate parameters. Additionally, this split isn't reflected in other string libraries I know of, nor is the concept universal across the actual encoding standards themselves (though in some cases it is used pedagogically).

So I get what the split is trying to capture, but I think it's counterproductive, especially since developers tend to get confused about the whole "encoding thing", and we make matters worse if we try to maintain a type of generality that outruns actual usage.

So in this model (to recap), if I have two files which contain the same Japanese text, one in Shift-JIS and one in UTF-8, then after reading those into a parrot string, they are identical. (They're the same text--right?) You could think of this as "early normalization", but my viewpoint if that the concept of an "encoding" deals exclusively with how to serialize a string for export (and the reverse), and not with the in-memory manipulation (or API) of a string itself. This is very much like defining a format to serialize objects into XML--you don't end up thinking of the objects as "XML-based" themselves.

There are a couple of key strengths of this approach, from a performance perspective (in addition to the conceptual benefit I am claiming):

1) Indexing into a string is O(1). (With the existing model, if you want to find the 1000th character in a string being represented in UTF-8, you need to start at the beginning and scan forward.)

2) There's no memory allocation (and hence no GC) needed during string comparisons or hash lookups. I consider that to be a major win.

3) The Boyer-Moore algorithm can be used for all string searches. (But currently there are a couple of cases I still need to fill in.)

Incidentally, this closely matches the approach used by ObjC and Java.

- Where is string->language?

I removed it from the string struct because I think that's the wrong place for it (and it wasn't actually being used anywhere yet, fortunately). The problem is that although many operations can be language-dependent (sorting, for example), the operation doesn't depend on the language of the strings involved, but rather on the locale of the reader. The classic example is a list containing a mixture of English and Swedish names; for an English reader the while list should be sorted in English order, and a Swedish reader would want to see them in Swedish alphabetical order. (That example is from Richard Gillam's book on Unicode.)

I'm assuming that was the intention of "language". If it was meant to indicate "perl" v. "python" or something, then I should put it back.

Also, we may want to re-create an API which allows us to obtain an integer to later use to specify an encoding (right now, they're specified by name, as C-strings), but the ICU API doesn't use this sort of mechanism, so at the moment we'd end up turning around and looking up the C-string to pass into the ICU API, so it isn't a performance-enhancer currently.

With this string type how do we deal with anything beyond codepoints?

Hmm, what do you mean? Even prior to this, all of our operations ended up relying on being able to either transcode two arbitrary strings to the same encoding, or ask arbitrary strings what their N-th character is. Ultimately, this carried the requirement that a string be representable as a series of code points. I don't think that there is anything beyond that which a string has to offer. But I'm curious as to what you had in mind.

And some misc remarks/questions:

- string_compare seems just to compare byte,short, or int_32

Yep, exactly.

(There are compare-strings-using-a-particular-normalization-form concepts which we'll need to ultimately handle, which are much like the simpler concept of case-insensitive comparison. I see these being handles by separate API/ops, and there are a couple of different direction that could take.)

- What happenend to external constant strings?

They should still work (or could). But the only cases in which we can optimize, and actually use "in-place" a buffer handed to string_make, is for a handful of encodings. But dealing with the "flags" argument passed into string_make may still be incomplete.

- What's the plan towards all the transcode opcodes? (And leaving these
  as a noop would have been simpler)

Basically there's no need for a transcode op on a string--it no longer makes sense, there's nothing to transcode. What we need to add is an "encode" which creates a bag of bytes from a string + an encoding, and a "decode" which creates a string based on a bag of bytes + an encoding. But we currently lack (as far as I can tell) a data type to hold a naked bag of bytes. It's in my plans to create a PMC for that, and once that exists to add the corresponding ops.

- hash_string seems not to deal with mixed encodings anymore.

Yep, since we're hashing based on characters rather than bytes, there's no such thing as mixed encodings. That means that, to use my example from above, a string representing the Japanese word for "sushi" will hash to the same thing no matter what encoding may have been used to represent it on disk.

- Why does PIO_putps convert to UTF-8?

Mostly for testing. We currently lack a way to associate an encoding with an IO handle (or with an IO read or write), and if you want to write out a string you have to pick an encoding--that's the recipe needed to convert a string into something write-able. Until we've developed that, I'm writing everything out in UTF-8, just as a sensible thing to use to allow basically any string to be written out. Before this, we were writing out the raw bytes used to internally represent the string, which never really made sense.

- Why does read/PIO_reads generate an UTF-8 first?

Similar to the above, but right now that API is only being called by the freeze/thaw stuff, and it's current odd state reflects what needed to be done it keep it working (basically, to match the write by PIO_putps). But ultimately, for the freeze-using-opcodes case we should be accumulating our bytes into a raw byte buffer rather than sneaking them into the body of a string, but since we don't yet have a data type representing a raw byte buffer (as mentioned above), this was a workaround.

Fundamentally, I think we need two sorts of IO API (which may be handled via IO layers or filters): byte-based and string-based. For the string based case, an encoding always needs to be specified in some manner (either implicitly or explicitly, and either associated with the IO handle or just with a particular read/write).

Of course, pass along any other questions/concerns you have.

JEff

Re: ICU incorporation and string changes heads-up

Reply via email to