"Martin v. Löwis" writes: > I disagree: Quoting from Unicode 5.0, section 5.4: > > # The individual components of implementations may have different > # levels of support for surrogates, as long as those components are > # assembled and communicate correctly.
"Assembly" is the problem. If chr() or a slice creates a lone surrogate and surrogateescape passes it back out, Python as a whole is non-conforming. Technically, you can hide behind "none of slicing, chr(), or surrogateescape promises to conform", and maybe that would fly to a standards lawyer; I'd have to see the precise statement. Here's a more convincing example. A user specifies "utf8" as her locale charset. Then she specifies a string containing a non-BMP character as the "description" of a file, and internal code munges this via slicing into a file name conforming to some specification (eg, length limit + uniquifier if needed). Then if the non-BMP character is in the "right" place, she will get either a broken file name, which will either get written to disk or raise an exception, depending on whether the munging program has enabled surrogateescape or not. I claim both of those results are non-conforming to the specification of UTF-16, and therefore Python Unicode processing as a whole must be considered non-conforming. It's still pretty damn good. But I've elaborated that point elsewhere. > The rationale for supporting these characters in chr() goes back much > further than the surrogateescape handler - as Python unicode strings > are sequences of code points, it would be impractical if you couldn't > create some of them, or even would have to consult the UCD before > determining whether they can be created. The Zen is irrelevant to determining conformance to Unicode, which has its own Zen. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com