On Wed, Oct 12, 2016 at 9:03 PM, Peter Saint-Andre <[email protected]> wrote:
> It's not clear to me that U+1F11 has the problem you describe; perhaps could
> you sketch it out further?
Oops, that should be U+0001F11A. The full example is:
"\U0001f11aevin" => "(K)evin" => "(k)evin"
I wrote a program to categorize characters that are not idempotent
under Nickname "ToLower" (ignoring white space). The numbers are the
same for Unicode 6.3, 8.0 and 9.0.
{
'<font>': 467,
'<square>': 90,
'<compat>': 35,
'<super>': 27,
'<circle>': 4
}
The following two characters also appear to fail the idempotent test.
The initial decompositions do not begin with '<'.
\u03d3 GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
\u03d4 GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL
> Thanks for your input. Personally I will think about it further and post
> again after I do so.
To me, the problem is to take untrusted input, validate it using
specified rules, and transform it into a stable, unambiguous format.
I'm still learning more about Unicode. Is there a reason that the case
mapping rule has to be applied *before* the normalization rule? The
order appears to make a difference for NFKC. I suppose the Nickname
"comparison" profile could re-apply the case mapping rule after the
normalization rule?
Thanks,
-Bill
_______________________________________________
precis mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/precis