On Thu, Aug 25, 2011 at 12:29 PM, Guido van Rossum <gu...@python.org> wrote: > Now I am happy to admit that for many Unicode issues the level at > which we have currently defined things (code units, I think -- the > thingies that encodings are made of) is confusing, and it would be > better to switch to the others (code points, I think). But characters > are right out.
Indeed, code points are the abstract concept and code units are the specific byte sequences that are used for serialisation (FWIW, I'm going to try to keep this straight in the future by remembering that the Unicode character set is defined as abstract points on planes, just like geometry). With narrow builds, code units can currently come into play internally, but with PEP 393 everything internal will be working directly with code points. Normalisation, combining characters and bidi issues may still affect the correctness of unicode comparison and slicing (and other text manipulation), but there are limits to how much of the underlying complexity we can effectively hide without being misleading. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com