> The only drawback I can see is if the UTF-8 bytes actually decode to a > half surrogate. However, half surrogates should really only occur in > UTF-16 (as I understand it), so they shouldn't be encoded in UTF-8 > anyway!
Right: that's the rationale for UTF-8b. Encoding half surrogates violates parts of the Unicode spec, so UTF-8b is "safe". > As for handling this case, you could either: > > 1. Raise an exception (which is what you're trying to avoid) > > or: > > 2. Treat it as invalid UTF-8 and map the bytes to half surrogates > (encoding would produce the original bytes). > > I'd prefer option 2. I hadn't thought of this case, but you are right - they *are* illegal bytes, after all. Raising an exception would be useless since the whole point of this codec is to never raise unicode errors. Regards, Martin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com