> On Mon, May 18, 2009 at 12:37:49PM -0400, Brandon S. Allbery KF8NH wrote: >> I would argue that if you are working with a grapheme cluster >> ("grapheme"), arithmetic on individual grapheme values is undefined.
Yup, that was exactly what I was arguing. >> In short, I think the only remotely sane result of ord() on a grapheme >> is an opaque value meaningful to chr() but to very little, if anything, >> else. Which is what we have with the negative integer spec. What I dislike is the transient, handlish nature of those values: like a handle, you can't store the value and then use it to reconstruct the grapheme later. But since actually storing the grapheme itself should be no great feat, I guess that's not much of a hardship. On Mon, May 18, 2009 at 1:37 PM, Larry Wall <la...@wall.org> wrote: > you can already write complete ord/chr nonsense at the codepoint level (even > in ASCII) Sorry, could you clarify what you mean by that? > And we can always resort to *uint32 and *int32 knowing that the Unicode >consortium > isn't going to use the top bit any time in the foreseeable future. s/top bit/top 11 bits/... > Note also that uint8 has nothing to do with UTF-8, and uint16 has > nothing to do with UTF-16. Surrogate pairs are represented by a single > integer in NFG. They are also represented by a single value in UTF-8; that is, the full scalar value is encoded directly, rather being first encoded into UTF-16 surrogates which are then encoded as UTF-8... > That is, NFG is always abstract codepoints of some sort Barely-relevant terminology nit: "abstract code points" sounds like something that would be associated with "abstract characters", which as defined in Unicode are formally distinct from graphemes, which is what we're talking about here. Also, the term "code points" includes the surrogates, which can only appear in UTF-16; I imagine the scalar values we deal with most of the time at the "character"/grapheme level would be the subset of code points excluding surrogates, which are called "Unicode scalar values". Surrogates are just weird, since they have assigned code points even though they're purely an encoding mechanism. As such, they straddle the line between abstract characters and an encoding form. I assume that if text comes in as UTF-16, the surrogates will disappear as far as character-level P6 code is concerned. So is there any way for P6 to manipulate surrogates as "characters"? Maybe an adverb or trait? Or does one have to descend to the bytewise layer for that? (As you said, that *normally* shouldn't be necessary outside encoding and decoding, where you need to do things bytewise anyway; just trying to cover all the bases...) -- Mark J. Reed <markjr...@gmail.com>