"Martin v. Löwis" writes: > Chapter and verse?
Unicode 5.0, Chapter 3, verse C9: When a process generates a code unit sequence which purports to be in a Unicode character encoding form, it shall not emit ill-formed code sequences. I think anything called "UTF-8 something" is likely to be taken to "purport". Furthermore, users don't necessarily see which error handlers are being used. A user who specifies "utf8" as the output codec is likely to be rather surprised if non-UTF-8 is emitted because the app specified surrogateescape. Eg, consider a script which munges file descriptions into reasonable-length file names on Unix. Yes, technically the non-Unicode output is the app's fault, but I expect many users will put some blame on Python. I am in full agreement with you about the technicalities, but I am looking for ways to clue in users that (a) the technicalities matter, and (b) that Python does a *very* good job of making things as safe as possible without becoming unable to handle bytes. I think "wide" vs. "narrow" fails at both. It focuses on storage issues, which of course are important, but at the cost of ignoring the fact that for users of non-BMP characters 32-bit code units are much safer. Users who need non-BMP characters are relatively few, and at least at the present time most are painfully aware of the need to care for technicalities. I expect them to be pleasantly surprised by how easy it is to get reasonably safe behavior even from a 16-bit build. > > Python's internal coding does not conform to UTF-16, and that internal > > coding can, under certain conditions, escape to the outside world as > > invalid "Unicode" output. > > I'm fairly certain there are provisions in the Unicode standard for such > behavior (taking into account "certain conditions"). Sure. There's nothing in the Unicode standard that says you have to conform to it unless you claim to conform to it. So it is valid to say that Python's Unicode codecs without surrogateescape do conform. The point is that Python does not, even if all of the input is valid Unicode, because of the provision of surrogateescape and the lack of Unicode conformance-checking for certain internal functionality like chr() and slicing. You can say "we don't make any such claim", but IMO the distinction in question is too fine a point for most users, and requires a very large amount of Unicode knowledge (not to mention standards geekiness) to even understand the precise statement. "Unicode support" to users should mean that Python does the right thing, not that if you look hard enough in the documentation you will discover that Python doesn't claim to do the right thing even though in practice it mostly does. IMO, "UCS-2" is a pretty good description of what the user can leave up to Python in perfect safety. RDM's reply worries me a little, but I'll reply to his message separately. > *Any* Unicode implementation will do that, since they all have to > support legacy encodings in some form. This is certainly conforming to > the Unicode standard, and in fact one of the primary Unicode design > principles. No. Support for legacy encodings takes you outside of the realm of Unicode conformance by definition. Their names tell you that, however. "UTF-8 with surrogate escapes" on the other hand is an entirely different kettle of fish. It pretends to be UTF-8, but isn't. I think that users who give Python valid input should be able to expect valid output, but they can't. Chapter 3, verse C7: When a process purports not to modify the interpretation of a valid coded character sequence, it shall make no change to that coded character sequence other than the possible replacement of character sequences by their canonical-equivalent sequences, or the deletion of *noncharacter* code points. Sure, you can tell users the truth: "Python may modify your Unicode characters if you slice or index Unicode strings. It may even silently turn them into invalid codes which will eventually raise Errors." Then you are conformant, but why would anyone want to use such a program? If you tell them "UCS-2[sic] Python is safe to use with *no* extra care if you use only UCS-2 [or BMP] characters", suddenly Python looks very nice indeed again. "UCS-4" Python is even better; all you have to do is to avoid surrogateescape codecs. However, you're still vulnerable to hard-to-diagnose errors at the output stage in case of program bugs, because not enough checking of values is done by Python itself. > > A Unicode-conforming Python implementation would error at the > > chr() call, or perhaps would not provide surrogateescape error > > handlers. > > Chapter and verse? Chapter 3, verse C9 again. > > "Although practicality beats purity." > > The Unicode standard itself is based on practicality. It wouldn't > have received the success it did if it was based on purity only > (and indeed, was often rejected in cases where it put purity over > practicality, e.g. with the Hangul syllables). Python practicality is very different from Unicode practicality. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com