If you haven't read the PDD, it's a good start.

To summarize, probably oversimplifying badly:

1. A grapheme is a character *as seen on the page.* That is, if composing "a" + "dot above" + "dot below" produces an a with dots above and below it, then THAT is the grapheme.

2. Unicode has a lot of characters that are single code points representing a complex grapheme. For example, the "A" + "ring above" composition shows up as the "Angstrom" symbol.

3. But on the other hand, some combination of basic characters plus combining marks DO NOT have a single code point that represents them. For example, while your girlfriend might compose "dotless lowercase i" with "combining heart above" to produce an i with a heart instead of a dot, there isn't a single codepoint in Unicode for that. (Unless girly-grrls got their own code page. Maybe in Unicode 6...)

4. Since that's a considerable PITA to deal with, we now have "NFG format", which really should have been called "NFW" format, IMO. (W = widechars, natch.) Every combination of "basic" plus "combining" marks *that gets used* will have a single grapheme allocated. Many of them, like the Angstrom symbol, or "O" + "combining röckdöts", will already have a "real" unicode grapheme. The rest of them will get negative numbers assigned, one at a time. The negative numbers will only be meaningful to the string they're in, or maybe only to the particular execution context. (There are issues with comparing, etc. Which is why I think maybe one table per execution.)

5. The result is that every grapheme (letter-on-the-page) will have a single number behind it, will have a length of 1, etc. So we can do meaningful substr($str, 2, 7) and get what we expect, even when the fifth grapheme requires a base character plus 4 combining marks.

All hail @Larry!

=Austin


Mark J. Reed wrote:
Do we really need to be able to map arbitrary graphemes to integers,
or is it enough to have an opaque value returned by ord() that, when
fed to chr(), returns the same grapheme?  If the latter, a list of
code points (in one of the official Normalzation Formats) would seem
to be sufficient.

On 5/18/09, Helmut Wollmersdorfer <hel...@wollmersdorfer.at> wrote:
Darren Duncan wrote:

Since you seem eager, I recommend you start with porting the Parrot PDD
28 to a new Perl 6 Synopsis 15, and continue from there.
IMHO we need some people for a broad discussion on the details first.

Helmut Wollmersdorfer



Reply via email to