[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

Adam Olsen Mon, 01 Sep 2008 23:52:10 -0700

Adam Olsen <[EMAIL PROTECTED]> added the comment:

Marc, I don't understand what you're saying.  UTF-16's surrogates are
not optional.  Unicode 2.0 and later require them, and Python is
supposed to support it.


Likewise, UCS-4 originally allowed a much larger range of code points,
but it no longer does; allowing them would mean supporting only old,
archaic versions of the standards (which is clearly not desirable.)

You are right in that I shouldn't have said "a pair of ill-formed code
units".  I should have said "a pair of unassigned code points", which is
how UCS-2 always have and always will classify them.

Although python may allow ill-formed sequences to be created internally
(primarily lone surrogates on UTF-16 builds), it cannot encode or decode
them.  The standard is clear that these are to be treated as errors,
which the .decode()'s "errors" argument controls.  You could add a new
value for "errors" to pass-through the garbage, but I fail to see a use
case for it.

_______________________________________
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3297>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

Reply via email to