On Fri, Nov 21, 2014 at 5:56 AM, Marko Rauhamaa <ma...@pacujo.net> wrote: > Michael Torrie <torr...@gmail.com>: > >> Unicode can only be encoded to bytes. >> Bytes can only be decoded to unicode. > > I don't really like it how Unicode is equated with text, or even > character strings. > > There's barely any difference between the truth value of these > statements: > > Python strings are ASCII. > > Python strings are Latin-1. > > Python strings are Unicode. > > Each of those statements is true as long as you stay within the > respective character sets, and cease to be true when your text contains > characters outside the character sets.
The difference is that ASCII and Latin-1 cut out a large number of active world languages, UCS-2 (the intermediate option you didn't mention) cuts out a small proportion (by usage) of significant characters, and Unicode cuts out only those characters which fall under issues like Han unification. (Plus any that haven't yet been allocated. But since Python doesn't actually validate code points to ensure that they've been given meanings, you can use today's Python to work with tomorrow's Unicode.) Do you have actual text that you're unable to represent in Unicode? If so, you are going to have major problems using it with *any* computer system. There are Japanese encodings that can represent additional characters, but they also *cannot* represent a lot of the other characters we use, so there'll be fundamental incompatibilities. > Now, it is true that Python currently limits itself to the 1,114,112 > Unicode code points. And it likely won't adopt more characters unless > Unicode does it first. However, text is something more lofty and > abstract than a sequence of Unicode code points. > > We shouldn't call strings Unicode any more than we call numbers IEEE or > times ISO. We don't call numbers IEEE, but if we're working with Python floats, we *do* require all numbers to be representable as IEEE floating-point. Don't like that? Pick decimal.Decimal instead, or fractions.Fraction, and pick a different set of limitations... but ultimately, you *will* have restrictions - and much tighter restrictions than Unicode places on text. Do you genuinely have text that you can't represent in Unicode, or are you just arguing against Unicode to try to justify "Python strings are <something else>" as a basis for your code? ChrisA -- https://mail.python.org/mailman/listinfo/python-list