Guido van Rossum writes: > No it cannot. We are talking about \u escapes, not about a string > literal containing Unicode characters ("Löwis").
Ah, good point. I apologize for mistyping the example. *I* *was* talking about a string literal containing Unicode characters. However, on my terminal, you can't see the difference! So I (ab)used the \u escapes to make clear that in one case the representation used 5 characters and in the other 6. > > > I might be writing either literal with the expectation to get > > > exactly that sequence of code points, This should be possible, agreed. Couldn't rawstring read syntax be given the right semantics? And of course you've always got tuples of integers. What bothers me about the "sequence of code points" way of thinking is that len("Löwis") is nondeterministic. To my mind, especially from the educational standpoint, but also from the point of view of implementing a text editor or docutils, that's much more horrible than Martin's point that len(a) + len(b) == len(a+b) could fail if we do NFC normalization. (NKD would work here.) I'm not sure what happened, but after recent upgrades to Python and docutils (presumably the latter) a bunch of Japanese reST documents of mine broke. I have no idea how to count the number of characters in a line containing Japanese any more (even having fixed the tables by trial and error, it's not obvious), but of course tables require being able to do that exactly. Normalization would guarantee TOOWDTI. But IMO the right way to do normalization in such cases is in Python itself. One is *never* going to be able to keep up with all the external libraries, and it seems very unlikely that many will be high quality from this point of view. So even if your own code does the right thing, you have to wrap every external module you call. Or you can rewrite Python to normalize in the right places once, and then you don't have to worry about it. (Bugs, yes, but then you fix them in the forked Python, and all your code benefits from the fix automatically.) > Bytes are not code points. The unicode string type has always been > about code points, not characters. I wish you had named it "widechar", then. I think that a language where len("Löwis") == len("Löwis") is an invariant is one honking good idea! > Have you ever even used the unicode string type in Python 2? Yes. On the Mac, I often have to run unicodes through normalization NFD because some levels of Mac OS X do normalize NFD and others don't normalize at all. That means that file names in particular tend to be different depending on whether I get them from the OS or from the user. But a test as simple as creating a file with a name containing \u010D and trying to stat it can fail, AIUI because stdio normalizes NFD but the raw OS stat call doesn't. This particular test does work in Python, I'm not sure what the difference is. Granted that that's part of the plan and not serendipity, nonetheless, I think the default case should be that text operations produce the expected result in the text domain, even at the expense of array invariants. People who need arrays of code points have several ways to get them, and the usual comparison operators will work on them as desired. While people who need operations on *text* still have no straightforward way to get them, and no promise of one as I read your remarks. _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com