New submission from Serhiy Storchaka: Trying to implement UTF-7 codec in Python I found some warts in error handling.
1. Non-ASCII bytes. No errors: >>> 'a€b'.encode('utf-7') b'a+IKw-b' >>> b'a+IKw-b'.decode('utf-7') 'a€b' Terminating '-' at the end of the string is optional. >>> b'a+IKw'.decode('utf-7') 'a€' And sometimes it is optional in the middle of the string (if following char is not used in BASE64). >>> b'a+IKw;b'.decode('utf-7') 'a€;b' But if following char is not ASCII, it is accepted as well, and this looks as a bug. >>> b'a+IKw\xffb'.decode('utf-7') 'a€ÿb' In all other cases non-ASCII byte causes an error: >>> b'a\xffb'.decode('utf-7') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/serhiy/py/cpython/Lib/encodings/utf_7.py", line 12, in decode return codecs.utf_7_decode(input, errors, True) UnicodeDecodeError: 'utf7' codec can't decode byte 0xff in position 1: unexpected special character >>> b'a\xffb'.decode('utf-7', 'replace') 'a�b' 2. Ending lone high surrogate. Lone surrogates are silently accepted by utf-7 codec. >>> '\ud8e4\U0001d121'.encode('utf-7') b'+2OTYNN0h-' >>> '\U0001d121\ud8e4'.encode('utf-7') b'+2DTdIdjk-' >>> b'+2OTYNN0h-'.decode('utf-7') '\ud8e4𝄡' >>> b'+2OTYNN0h'.decode('utf-7') '\ud8e4𝄡' >>> b'+2DTdIdjk-'.decode('utf-7') '𝄡\ud8e4' Except at the end of unterminated shift sequence: >>> b'+2DTdIdjk'.decode('utf-7') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/serhiy/py/cpython/Lib/encodings/utf_7.py", line 12, in decode return codecs.utf_7_decode(input, errors, True) UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-8: unterminated shift sequence 3. Incorrect shift sequence. Strange behavior happens when shift sequence ends with wrong bits. >>> b'a+IKx-b'.decode('utf-7', 'ignore') 'a€b' >>> b'a+IKx-b'.decode('utf-7', 'replace') 'a€�b' >>> b'a+IKx-b'.decode('utf-7', 'backslashreplace') 'a€\\x2b\\x49\\x4b\\x78\\x2db' The decoder first decodes as much characters as can, and then pass all shift sequence (including already decoded bytes) to error handler. Not sure this is a bug, but this differs from common behavior of other decoders. ---------- components: Unicode messages: 248450 nosy: ezio.melotti, haypo, lemburg, loewis, serhiy.storchaka priority: normal severity: normal status: open title: Warts in UTF-7 error handling type: behavior versions: Python 2.7, Python 3.4, Python 3.5, Python 3.6 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue24848> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com