[issue10542] Py_UNICODE_NEXT and other macros for surrogates

Alexander Belopolsky Wed, 29 Dec 2010 10:31:37 -0800

Alexander Belopolsky <[email protected]> added the comment:


On Sat, Nov 27, 2010 at 5:24 PM, Marc-Andre Lemburg
<[email protected]> wrote:
..
> Perhaps we should allow ord() to work on surrogates in
> UCS4 builds as well. That would reduce the number of
> surprises.
>

This is an interesting idea, however, having surrogates in UCS4 builds
will sooner or later lead to surprises such as

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
position 0: surrogates not allowed

I though UCS4 (or more properly, UTF-32) did not allow encoding of
surrogate code points.

It is somewhat bothersome that a valid string literal such as
'\uD800\uDC00' in narrow build is subtly invalid in wide build.  It
would probably be better if  '\uD800\uDC00'  was either rejected on a
wide build, or interpreted as a single character so that

True

on any build.

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue10542>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue10542] Py_UNICODE_NEXT and other macros for surrogates

Reply via email to