At 08:25 02/07/10 +0200, Dan Oscarsson wrote: >I do not know why stringprep only have NFKC or unnormalised as possible >choices. NFC is a very suitable choice to use for UCS. >It is the choice of W3C
Yes, the most important point here being that W3C deals with all kinds of text rather than just identifiers. >and is the required choice in IRI/URIs. Please read the newest draft, at http://www.ietf.org/internet-drafts/draft-duerst-iri-01.txt NFC is indeed the default choice. Something like this is needed because otherwise, you don't know what UCS codepoints you get for e.g. an 'a' with two dots above. But NFC is not applied everywhere; if you get an IRI already encoded in Unicode, it's not normalized again, and there is even the option that a user enters something unnormalized on purpose. The main reason for this is that URIs/IRIs are 'greatest common denominators' for a lot of other identifiers. It's rather clear that we don't need a circled 'A' in IDN. But we don't want to eliminate circled 'A's from all other identifiers. For non-normalized text, we don't want to disallow the following: http://example.com/normalize.cgi?input=<something-non-normalized> Regards, Martin. >NFC have the nice properties that it preserves all information and is >compact. It is a very good choice to use for interoperability. >Unnormalised is only useful locally on a system, never for >interoperability. >NFKC is a possible choice when matching text but may go to far when >related to identifiers (names). > > Dan
