Re: "Unicode in 'NFG' formation" ?

Larry Wall Mon, 18 May 2009 10:38:12 -0700

On Mon, May 18, 2009 at 12:37:49PM -0400, Brandon S. Allbery KF8NH wrote:
> On May 18, 2009, at 09:21 , Mark J. Reed wrote:
>> If you're doing arithmetic with the code points or scalar values of
>> characters, then the specific numbers would seem to matter.  I'm
>
>
> I would argue that if you are working with a grapheme cluster  
> ("grapheme"), arithmetic on individual grapheme values is undefined.   
> What is the meaning of ord(\c[LATIN LETTER T WITH DOT ABOVE, COMBINING  
> DOT BELOW]) + 1?  If you say it increments the base character (a  
> reasonable-looking initial stance), what happens if I add an amount  
> which changes the base character to a combining character?  And what  
> happens if the original grapheme doesn't have a base character?
>
> In short, I think the only remotely sane result of ord() on a grapheme  
> is an opaque value meaningful to chr() but to very little, if anything, 
> else.  If you want to represent it as an integer, fine, but it should be 
> obscured such that math isn't possible on it.  Conversely, if you want 
> ord() values you can manipulate, you must work at the codepoint level.


Sure, but this is a weak argument, since you can already write complete
ord/chr nonsense at the codepoint level (even in ASCII), and all we're
doing here is making graphemes work more like codepoints in terms of
storage and indexing.  If people abuse it, they have only themselves
to blame for relying on what is essentially an implementation detail.
The whole point of ord is to cheat, so if they get caught cheating,
well, they just have to take their lumps.  In the age of Unicode,
ord and chr are pretty much irrelevant to most normal text processing
anyway except for encoders and decoders, so there's not a great deal
of point in labeling the integers as an opaque type, in my opinion.

As an implementation detail however, it's important to note that
the signed/unsigned distinction gives us a great deal of latitude
in how to store a particular sequence of integers.  Latin-1 will (by
definition) fit in a *uint8, while ASCII plus (no more that 128) NFG
negatives will fit into *int8.  Most European languages will fit into
*int16 with up to 32768 synthetic chars.  Most Asian text still fits
into *uint16 as long as they don't synthesize codepoints.  And we can
always resort to *uint32 and *int32 knowing that the Unicode consortium
isn't going to use the top bit any time in the foreseeable future.
(Unless, of course, they endorse something resembling NFG. :)

Note also that uint8 has nothing to do with UTF-8, and uint16 has
nothing to do with UTF-16.  Surrogate pairs are represented by a single
integer in NFG.  That is, NFG is always abstract codepoints of some
sort without regard to the underlying representation.  In that sense
it's not important that synthetic codepoints are negative, of course.

Larry

Re: "Unicode in 'NFG' formation" ?

Reply via email to