Rauli Ruohonen writes: > In my mind everything in a Python program is within a single > Unicode process,
Which is a *serious* mistake. It is *precisely* the mistake that leads to mixing UTF-16 and UCS-2 interpretations in the standard library. What you are saying is that if you write a 10-line script that claims Unicode conformance, you are responsible for the Unicode- correctness of all modules you call implicitly as well as that of the Python interpreter. This is what I mean by "Unicode conformance is not a goal of the language." Now, it's really not so bad. If you look at what MAL and MvL are doing (inter alia, it's their work I'm most familiar with), what you will see is that they are gradually implementing conformant modules here and there. Eg, I am sure it is not MvL's laziness or inability to come up with a reasonable spec himself that causes PEP 3131 to be a profile of UAX #31. > Actually, I said that there's no way to always do the right thing as long > as they are mixed, but that was a too theoretical argument. Practically > speaking, there's little need to interpret surrogate pairs as two > code points instead of as one non-BMP code point. Again, a mistake. In the standard library, the question is not "do I need this?", but "what happens if somebody else does it?" They may receive the same answer, but then again they may not. For example, suppose you have a supplier-consumer pair sharing a fixed-length buffer of 2-octet code units. If it should happen that the supplier uses the UCS-2 interpretation, then a surrogate pair may get split when the buffer is full. Will a UTF-16 consumer be prepared for this? Almost surely some will not, because that would imply maintaining an internal buffer, which is stupidly inefficient if you have an external buffer protocol. Note that an UTF-16 supplier feeding a UCS-2 consumer will have no problems (unless the UCS-2 consumer can't handle "short reads", but that's unlikely), and if you have a chain starting with a UTF-16 source, then none of the downstream UTF-16 processes have a problem. The problem is, suppose somehow you get a UCS-2 source? Whose responsibility is it to detect that? > Java and C# (and thus Jython and IronPython too) also sometimes use > UCS-2, sometimes UTF-16. As long as it works as you expect, there > isn't a problem, really. That depends on how big a penalty you face if you break a promise of conformance to your client. Death, taxes, and Murphy's Law are inescapable. > On UCS-4 builds of CPython it's the same (either UCS-4 or UTF-32 with the > extension that surrogates work as in UTF-16), but you get the extra > complication that some equal strings don't compare equal, e.g. > u'\U00010000' != u'\ud800\udc00'. Even that doesn't cause problems in > practice, because you shouldn't have strings like u'\ud800\udc00' in the > first place. But the Unicode standard itself gives (the equivalent of) u'\ud800' + u'\udc00' as an example of the kind of thing you *should be able to do*. Because, you know, clients of the standard library *will* be doing half-witted[1] things like that. Footnotes: [1] What I wanted to say was いい加減にしろよ! <wink> _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
