Hi Bill, thanks for your message and sorry about the seriously delayed reply - I've been working to finish some other Internet-Drafts and now have time again to finish the PRECIS updates.

On 10/13/16 1:33 PM, William Fisher wrote:
On Wed, Oct 12, 2016 at 9:03 PM, Peter Saint-Andre <stpe...@stpeter.im> wrote:
It's not clear to me that U+1F11 has the problem you describe; perhaps could 
you sketch it out further?

Oops, that should be U+0001F11A.

Did you mean U+212A (KELVIN SIGN)? That decomposes to U+004B (LATIN CAPITAL LETTER K).

The full example is:
"\U0001f11aevin" => "(K)evin" => "(k)evin"

Yes, "U+212Aevin" => "Kevin" via NFKC.

However, "U+212Aevin" => "kevin" via toLower() if I am not mistaken.

I wrote a program to categorize characters that are not idempotent
under Nickname "ToLower" (ignoring white space). The numbers are the
same for Unicode 6.3, 8.0 and 9.0.

{
  '<font>': 467,
  '<square>': 90,
  '<compat>': 35,
  '<super>': 27,
  '<circle>': 4
}

Would you mind sending me your list of characters? (I'm happy to receive it off-list.) I suspect that it might be similar to a list that emerged from differing assumptions regarding how to apply the PRECIS rules in implementations. My original implementation for testing purposes was rather naïve, whereas the implementation that Yoshiro Yoneya and Takahiro Nemoto created was smarter, in the sense that it would follow the chain of characters and decompose each one fully as it went along (this might require a few rounds of applying the normalization rule in order to fully decompose the original characters).

The following two characters also appear to fail the idempotent test.
The initial decompositions do not begin with '<'.

\u03d3 GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
\u03d4 GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL

These examples are different from the KELVIN SIGN example, because there is no direct toLower() transformation - normalization needs to happen before toLower() is applied.

Thanks for your input. Personally I will think about it further and post again 
after I do so.

To me, the problem is to take untrusted input, validate it using
specified rules, and transform it into a stable, unambiguous format.
I'm still learning more about Unicode.

Trust me: it never ends.

Is there a reason that the case
mapping rule has to be applied *before* the normalization rule?

As explained in Section 5.2.1 of RFC 7564, there is a good reason to apply the width mapping rule before the normalization rule. I'm now less sure that it makes sense, for comparison purposes, to apply the case mapping rule before the normalization rule.

The
order appears to make a difference for NFKC.  I suppose the Nickname
"comparison" profile could re-apply the case mapping rule after the
normalization rule?

If I understand correctly, you are suggesting that an implementation that is processing nickname strings for purposes of comparison would do the following:

1. Apply the "enforcement" action in Section 2.3
2. Apply the "comparison" action in Section 2.4

Let's choose a practical but somewhat contrived example: a nickname of ΨϓΧΗ, which is U+03A8 U+03D3 U+03A7 U+0397 (something like an uppercase version of the Greek word for soul, although the accent is wrong). This includes the code point U+03D3 that you mention above (which, by the way, is not the standard code point for the Greek letter upsilon but an alternative with a hook symbol, the usual character being U+03A5).

The two-step process you suggest would involve the following:

1. The "enforcement" action results in normalization (note that full normalization involves several steps):

U+03A8 U+03D3 U+03A7 U+0397
=>
U+03A8 U+03D2 U+0301 U+03A7 U+0397
=>
U+03A8 U+03A5 U+0301 U+03A7 U+0397

(note that U+03D2 has a compatibility equivalent of U+03A5)

2. The "comparison" action results in case mapping:

U+03A8 U+03A5 U+0301 U+03A7 U+0397
=>
U+03C8 U+03C5 U+0301 U+03C7 U+03B7

Thus, for comparison purposes, ΨϓΧΗ and ψύχη would be considered equivalent.

Unfortunately, even though that seems to yield the correct outcome, it's not what RFC 7700 specifies.

I'll continue to think about this - in particular, about any negative implications from modifying the order of operations so that normalization comes before case mapping (unlike what we specified in RFC 7700). Because we would prefer that all the PRECIS specs follow the same order, we'd also need to look at the implications for RFC 7613 (although the OpaqueString profile that we use there for passwords has quite a different purpose than the Nickname profile in RFC 7700); however, this might not be possible.

Peter

_______________________________________________
precis mailing list
precis@ietf.org
https://www.ietf.org/mailman/listinfo/precis

Reply via email to