Re: "Unicode in 'NFG' formation" ?

Mark J. Reed Mon, 18 May 2009 11:16:34 -0700

> On Mon, May 18, 2009 at 12:37:49PM -0400, Brandon S. Allbery KF8NH wrote:
>> I would argue that if you are working with a grapheme cluster
>> ("grapheme"), arithmetic on individual grapheme values is undefined.

Yup, that was exactly what I was arguing.

>> In short, I think the only remotely sane result of ord() on a grapheme
>> is an opaque value meaningful to chr() but to very little, if anything,
>> else.

Which is what we have with the negative integer spec.  What I dislike
is the transient, handlish nature of those values: like a handle, you
can't store the value and then use it to reconstruct the grapheme
later.  But since actually storing the grapheme itself should be no
great feat, I guess that's not much of a hardship.

On Mon, May 18, 2009 at 1:37 PM, Larry Wall <la...@wall.org> wrote:
> you can already write complete ord/chr nonsense at the codepoint level (even 
> in ASCII)

Sorry, could you clarify what you mean by that?

> And we can  always resort to *uint32 and *int32 knowing that the Unicode 
>consortium
> isn't going to use the top bit any time in the foreseeable future.

s/top bit/top 11 bits/...

> Note also that uint8 has nothing to do with UTF-8, and uint16 has
> nothing to do with UTF-16.  Surrogate pairs are represented by a single
> integer in NFG.

They are also represented by a single value in UTF-8; that is, the
full scalar value is encoded directly, rather being first encoded into
UTF-16 surrogates which are then encoded as UTF-8...

> That is, NFG is always abstract codepoints of some sort

Barely-relevant terminology nit: "abstract code points" sounds like
something that would be associated with "abstract characters", which
as defined in Unicode are formally distinct from graphemes, which is
what we're talking about here.

Also, the term "code points" includes the surrogates, which can only
appear in UTF-16; I imagine the scalar values we deal with most of the
time at the "character"/grapheme level would be the subset of code
points excluding surrogates, which are called "Unicode scalar values".

Surrogates are just weird, since they have assigned code points even
though they're purely an encoding mechanism.  As such, they straddle
the line between abstract characters and an encoding form. I assume
that if text comes in as UTF-16, the surrogates will disappear as far
as character-level P6 code is concerned.  So is there any way for P6
to manipulate surrogates as "characters"?  Maybe an adverb or trait?
Or does one have to descend to the bytewise layer for that?  (As you
said, that *normally* shouldn't be necessary outside encoding and
decoding, where you need to do things bytewise anyway; just trying to
cover all the bases...)
-- 
Mark J. Reed <markjr...@gmail.com>

Re: "Unicode in 'NFG' formation" ?

Reply via email to