[issue14200] Idle shell crash on printing non-BMP unicode character

Vlastimil Brom Mon, 05 Mar 2012 16:39:18 -0800

Vlastimil Brom <[email protected]> added the comment:

I'd like to add some further observations to the mentioned issue;
it seems, that the crash is indeed not specific to idle.
In a sample tkinter app, where I just display e.g. chr(66352) in an Entry 
widget, I also get the same immediate crash via pythonw.exe and the previously 
mentioned "proper" ValueError without a crash with python.exe.


I also tried to explicitly display surrogate pair, which were used 
automatically until python 3.2; these can be used in tkinter in 3.3, but there 
are limitations and discrepancies:

>>> 
>>> got_ahsa = "\N{GOTHIC LETTER AHSA}"
>>> def wide_char_to_surrog_pair(char):
    code_point = ord(char)
    if code_point <= 0xFFFF:
        return char
    else:
        high_surr = (code_point - 0x10000) // 0x400 + 0xD800
        low_surr = (code_point - 0x10000) % 0x400 + 0xDC00
        return chr(high_surr)+chr(low_surr)

>>> ahsa_surrog = wide_char_to_surrog_pair(got_ahsa)
>>> print(ahsa_surrog)
𐌰
>>> repr(ahsa_surrog)
"'_ud800\x00udf30'"
>>> ahsa_surrog
'Pud800 udf30'

[the space in the middle of the last item might be \x00, as it terminates the 
clipboard content, the rest is copied separately]

the printed square corresponds with the given character and can be used in 
other programs etc. (whereas in py 3.2, the same value was used for repr and a 
direct "display" of the string in the interpreter, there are three different 
formats in py 3.3.

I also noticed that surogate pair is not supported as input for 
unicodedata.name(...) anymore:
 
>>> import unicodedata
>>> unicodedata.name(ahsa_surrog)
Traceback (most recent call last):
  File "<pyshell#60>", line 1, in <module>
    unicodedata.name(ahsa_surrog)
TypeError: need a single Unicode character as parameter
>>> 

(in 3.2 and probably others it returns the expected 'GOTHIC LETTER AHSA')

(I for my part would think, that e.g. keeping a  bit liberal (but still 
non-ambiguous) input possibilities for unicodedata wouldn't hurt. Also, if 
tkinter is not going to support wide unicode natively any time soon, the output 
conversion using surrogates, which are also understandable for other programs, 
seems the most usable option in this regard.

Hopefully, this is somehow relevant for the original issue -
I am somehow not sure, whether some parts would be better posted as separate 
issues, or whether this is the planned and expected behaviour anyway.

regards,
   vbr

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue14200>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue14200] Idle shell crash on printing non-BMP unicode character

Reply via email to