Re: "Unicode in 'NFG' formation" ?

Austin Hastings Mon, 18 May 2009 06:11:32 -0700

If you haven't read the PDD, it's a good start.

To summarize, probably oversimplifying badly:

1. A grapheme is a character *as seen on the page.* That is, ifcomposing "a" + "dot above" + "dot below" produces an a with dots aboveand below it, then THAT is the grapheme.

2. Unicode has a lot of characters that are single code pointsrepresenting a complex grapheme. For example, the "A" + "ring above"composition shows up as the "Angstrom" symbol.

3. But on the other hand, some combination of basic characters pluscombining marks DO NOT have a single code point that represents them.For example, while your girlfriend might compose "dotless lowercase i"with "combining heart above" to produce an i with a heart instead of adot, there isn't a single codepoint in Unicode for that. (Unlessgirly-grrls got their own code page. Maybe in Unicode 6...)

4. Since that's a considerable PITA to deal with, we now have "NFGformat", which really should have been called "NFW" format, IMO. (W =widechars, natch.) Every combination of "basic" plus "combining" marks*that gets used* will have a single grapheme allocated. Many of them,like the Angstrom symbol, or "O" + "combining röckdöts", will alreadyhave a "real" unicode grapheme. The rest of them will get negativenumbers assigned, one at a time. The negative numbers will only bemeaningful to the string they're in, or maybe only to the particularexecution context. (There are issues with comparing, etc. Which is why Ithink maybe one table per execution.)

5. The result is that every grapheme (letter-on-the-page) will have asingle number behind it, will have a length of 1, etc. So we can domeaningful substr($str, 2, 7) and get what we expect, even when thefifth grapheme requires a base character plus 4 combining marks.


All hail @Larry!

=Austin


Mark J. Reed wrote:

Do we really need to be able to map arbitrary graphemes to integers,
or is it enough to have an opaque value returned by ord() that, when
fed to chr(), returns the same grapheme?  If the latter, a list of
code points (in one of the official Normalzation Formats) would seem
to be sufficient.

On 5/18/09, Helmut Wollmersdorfer <hel...@wollmersdorfer.at> wrote:

Darren Duncan wrote:

Since you seem eager, I recommend you start with porting the Parrot PDD
28 to a new Perl 6 Synopsis 15, and continue from there.

IMHO we need some people for a broad discussion on the details first.

Helmut Wollmersdorfer

Re: "Unicode in 'NFG' formation" ?

Reply via email to