Eryk Sun added the comment:
Probably Python 2's UTF-16 decoder should be as broken as the encoder, which
will match the broken behavior of the UTF-8 and UTF-32 codecs:
>>> u'\ud83d\uda12'.encode('utf-8').decode('utf-8')
u'\ud83d\uda12'
>>> u'\ud83d\uda12'.encode('utf-32-le').decode('utf-32-le')
u'\ud83d\uda12'
Lone surrogate codes aren't valid Unicode. In Python 3 they get used internally
for tricks like the "surrogateescape" error handler. In Python 3.4+. the
'surrogatepass' error handler allows encoding and decoding lone surrogates:
>>> u'\ud83d\uda12'.encode('utf-16le', 'surrogatepass')
b'=\xd8\x12\xda'
>>> _.decode('utf-16le', 'surrogatepass')
'\ud83d\uda12'
----------
nosy: +eryksun
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue27971>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com