Guido van Rossum writes: > If you want to have an abstraction that guarantees you'll never see > an unnormalized text string you should design a library for doing so.
OK. > (*) It looks like such a library will not have a way to talk about > "\u0308" at all, since it is considered unnormalized. >From the Unicode Standard, v4.0, p. 43: "In the Unicode Standard, all sequences of character codes are permitted." Since normalization only applies to characters with decompositions, "\u0308" is indeed valid Unicode, a one-character sequence in NFC. AFAIK, the only strings the Unicode standard absolutely prohibits emitting are those containing code points guaranteed not to be characters by the standard. And normalization is simply a internal technique that allows text operations to be implemented code-point- wise without fear that emitting them would result in illegal sequences or other externally visible incompatibilities with the standard. So there's nothing "wrong by definition" about defining strings as sequences of code points, and string operations in code-point-wise fashion. It just makes that library for Unicode more expensive to design and operate, and will require auditing and reimplementation of common libraries (including the standard library) by every program that requires strict Unicode conformance. _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
