On Wed, Oct 12, 2016 at 9:03 PM, Peter Saint-Andre <stpe...@stpeter.im> wrote:
> It's not clear to me that U+1F11 has the problem you describe; perhaps could 
> you sketch it out further?

Oops, that should be U+0001F11A. The full example is:
"\U0001f11aevin" => "(K)evin" => "(k)evin"

I wrote a program to categorize characters that are not idempotent
under Nickname "ToLower" (ignoring white space). The numbers are the
same for Unicode 6.3, 8.0 and 9.0.

{
  '<font>': 467,
  '<square>': 90,
  '<compat>': 35,
  '<super>': 27,
  '<circle>': 4
}

The following two characters also appear to fail the idempotent test.
The initial decompositions do not begin with '<'.

\u03d3 GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
\u03d4 GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL


> Thanks for your input. Personally I will think about it further and post 
> again after I do so.

To me, the problem is to take untrusted input, validate it using
specified rules, and transform it into a stable, unambiguous format.
I'm still learning more about Unicode. Is there a reason that the case
mapping rule has to be applied *before* the normalization rule?  The
order appears to make a difference for NFKC.  I suppose the Nickname
"comparison" profile could re-apply the case mapping rule after the
normalization rule?

Thanks,
-Bill

_______________________________________________
precis mailing list
precis@ietf.org
https://www.ietf.org/mailman/listinfo/precis

Reply via email to