On Mon, May 18, 2009 at 12:37:49PM -0400, Brandon S. Allbery KF8NH wrote: > On May 18, 2009, at 09:21 , Mark J. Reed wrote: >> If you're doing arithmetic with the code points or scalar values of >> characters, then the specific numbers would seem to matter. I'm > > > I would argue that if you are working with a grapheme cluster > ("grapheme"), arithmetic on individual grapheme values is undefined. > What is the meaning of ord(\c[LATIN LETTER T WITH DOT ABOVE, COMBINING > DOT BELOW]) + 1? If you say it increments the base character (a > reasonable-looking initial stance), what happens if I add an amount > which changes the base character to a combining character? And what > happens if the original grapheme doesn't have a base character? > > In short, I think the only remotely sane result of ord() on a grapheme > is an opaque value meaningful to chr() but to very little, if anything, > else. If you want to represent it as an integer, fine, but it should be > obscured such that math isn't possible on it. Conversely, if you want > ord() values you can manipulate, you must work at the codepoint level.
Sure, but this is a weak argument, since you can already write complete ord/chr nonsense at the codepoint level (even in ASCII), and all we're doing here is making graphemes work more like codepoints in terms of storage and indexing. If people abuse it, they have only themselves to blame for relying on what is essentially an implementation detail. The whole point of ord is to cheat, so if they get caught cheating, well, they just have to take their lumps. In the age of Unicode, ord and chr are pretty much irrelevant to most normal text processing anyway except for encoders and decoders, so there's not a great deal of point in labeling the integers as an opaque type, in my opinion. As an implementation detail however, it's important to note that the signed/unsigned distinction gives us a great deal of latitude in how to store a particular sequence of integers. Latin-1 will (by definition) fit in a *uint8, while ASCII plus (no more that 128) NFG negatives will fit into *int8. Most European languages will fit into *int16 with up to 32768 synthetic chars. Most Asian text still fits into *uint16 as long as they don't synthesize codepoints. And we can always resort to *uint32 and *int32 knowing that the Unicode consortium isn't going to use the top bit any time in the foreseeable future. (Unless, of course, they endorse something resembling NFG. :) Note also that uint8 has nothing to do with UTF-8, and uint16 has nothing to do with UTF-16. Surrogate pairs are represented by a single integer in NFG. That is, NFG is always abstract codepoints of some sort without regard to the underlying representation. In that sense it's not important that synthetic codepoints are negative, of course. Larry