[issue27971] utf-16 decoding can't handle lone surrogates

Eryk Sun Tue, 06 Sep 2016 07:11:22 -0700

Eryk Sun added the comment:

Probably Python 2's UTF-16 decoder should be as broken as the encoder, which 
will match the broken behavior of the UTF-8 and UTF-32 codecs:


    >>> u'\ud83d\uda12'.encode('utf-8').decode('utf-8')
    u'\ud83d\uda12'
    >>> u'\ud83d\uda12'.encode('utf-32-le').decode('utf-32-le')
    u'\ud83d\uda12'

Lone surrogate codes aren't valid Unicode. In Python 3 they get used internally 
for tricks like the "surrogateescape" error handler. In Python 3.4+. the 
'surrogatepass' error handler allows encoding and decoding lone surrogates: 

    >>> u'\ud83d\uda12'.encode('utf-16le', 'surrogatepass')
    b'=\xd8\x12\xda'
    >>> _.decode('utf-16le', 'surrogatepass')
    '\ud83d\uda12'

----------
nosy: +eryksun

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue27971>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue27971] utf-16 decoding can't handle lone surrogates

Reply via email to