On 6/10/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote: > I think you misunderstand. Anything in Unicode that is normative is > about interchange. Strings are also a means of interchange---between > modules (separate Unicode processes) in a program (single OS process).
Like Martin said, "what is a process?" :-) If you have a module that uses noncharacters to mean something and it documents that, then that may well be useful to its users. In my mind everything in a Python program is within a single Unicode process, unless you have a very high level component which specifies otherwise in its API documentation. > Your complaint about Python mixing "pseudo-UTF-16" with "pseudo-UCS-2" > is precisely a statement that various modules in Python do not specify > what encoding forms they purport to accept or emit. Actually, I said that there's no way to always do the right thing as long as they are mixed, but that was a too theoretical argument. Practically speaking, there's little need to interpret surrogate pairs as two code points instead of as one non-BMP code point. The best use case I could come up with was reading in an ill-formed UTF-8 file to see what makes it ill-formed, but that's best done using bytes anyway. E.g. '\xed\xa0\x80\xed\xb0\x80\xf0\x90\x80\x80' decodes to u'\ud800\udc00\U00010000' on both builds, but as on a UCS-2 build u'\U00010000' == u'\ud800\udc00', the distinction is lost there. Effectively the codec has decoded the first two code points to UCS-2 and the the last code point to UTF-16, forming a string which mixes the two interpretations instead of using one of them consistently, and because of that you can no longer recover the original code point stream. But what the decoder should really do is raise an exception anyway, as the input is ill-formed. Java and C# (and thus Jython and IronPython too) also sometimes use UCS-2, sometimes UTF-16. As long as it works as you expect, there isn't a problem, really. On UCS-4 builds of CPython it's the same (either UCS-4 or UTF-32 with the extension that surrogates work as in UTF-16), but you get the extra complication that some equal strings don't compare equal, e.g. u'\U00010000' != u'\ud800\udc00'. Even that doesn't cause problems in practice, because you shouldn't have strings like u'\ud800\udc00' in the first place. _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
