Ezio Melotti <ezio.melo...@gmail.com> added the comment:

It turned out that this can't be fixed in 2.7 unless we backport the patch in 
#5127 (it's in 3.2/3.3 but not in 2.7).

IIUC the macro works fine and joins surrogate pairs to a Py_UCS4 char, but 
since the Py_UNICODE_IS* macros still expect Py_UCS2 on narrow builds on 2.7, 
the higher bits gets truncated and the macros return wrong results.

So, for example
    >>> u'\ud800\udc42'.isupper()
    True
because \ud800 + \udc42 = \U000100429  →  \U000100429 gets truncated to \u0429  
→  \u0429 is the CYRILLIC CAPITAL LETTER SHCHA  →  .isupper() returns True.

The current behavior is instead broken in another way, because it checks that 
u'\ud800'.isupper() and u'\udc42'.isupper() separately.

Would it make sense to backport #5127 or should I just give up and leave it 
broken?

----------
title: Make str methods work with non-BMP chars on narrow builds -> Make the 
str.is* methods work with non-BMP chars on narrow builds

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue9200>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to