On Wed, Oct 12, 2016 at 9:03 PM, Peter Saint-Andre <stpe...@stpeter.im> wrote:
> It's not clear to me that U+1F11 has the problem you describe; perhaps could
> you sketch it out further?
Oops, that should be U+0001F11A. The full example is:
"\U0001f11aevin" => "(K)evin" => "(k)evin"
I wrote a program to categorize characters that are not idempotent
under Nickname "ToLower" (ignoring white space). The numbers are the
same for Unicode 6.3, 8.0 and 9.0.
The following two characters also appear to fail the idempotent test.
The initial decompositions do not begin with '<'.
\u03d3 GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
\u03d4 GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL
> Thanks for your input. Personally I will think about it further and post
> again after I do so.
To me, the problem is to take untrusted input, validate it using
specified rules, and transform it into a stable, unambiguous format.
I'm still learning more about Unicode. Is there a reason that the case
mapping rule has to be applied *before* the normalization rule? The
order appears to make a difference for NFKC. I suppose the Nickname
"comparison" profile could re-apply the case mapping rule after the
precis mailing list