At 08:25 02/07/10 +0200, Dan Oscarsson wrote:

>I do not know why stringprep only have NFKC or unnormalised as possible
>choices. NFC is a very suitable choice to use for UCS.
>It is the choice of W3C

Yes, the most important point here being that W3C deals with all
kinds of text rather than just identifiers.


>and is the required choice in IRI/URIs.

Please read the newest draft, at
http://www.ietf.org/internet-drafts/draft-duerst-iri-01.txt
NFC is indeed the default choice. Something like this is needed
because otherwise, you don't know what UCS codepoints you get for
e.g. an 'a' with two dots above. But NFC is not applied everywhere;
if you get an IRI already encoded in Unicode, it's not normalized
again, and there is even the option that a user enters something
unnormalized on purpose.

The main reason for this is that URIs/IRIs are 'greatest common
denominators' for a lot of other identifiers. It's rather clear
that we don't need a circled 'A' in IDN. But we don't want to
eliminate circled 'A's from all other identifiers. For non-normalized
text, we don't want to disallow the following:

http://example.com/normalize.cgi?input=<something-non-normalized>

Regards,     Martin.


>NFC have the nice properties that it preserves all information and is
>compact. It is a very good choice to use for interoperability.
>Unnormalised is only useful locally on a system, never for
>interoperability.
>NFKC is a possible choice when matching text but may go to far when
>related to identifiers (names).
>
>    Dan


Reply via email to