New submission from Ma Lin: This issue is split from issue24117, that issue became a soup of small issues, so I'm going to close it.
For 4-byte GB18030 sequence, the legal range is: 0x81-0xFE for the 1st byte 0x30-0x39 for the 2nd byte 0x81-0xFE for the 3rd byte 0x30-0x39 for the 4th byte GB18030 standard: https://en.wikipedia.org/wiki/GB_18030 https://pan.baidu.com/share/link?shareid=2606985291&uk=3341026630 The current code forgets to check 0xFE for the 1st and 3rd byte. Therefore, there are 8630 illegal 4-byte sequences can be decoded by GB18030 codec, here is an example: # legal sequence b'\x81\x31\x81\x30' is decoded to U+060A, it's fine. uchar = b'\x81\x31\x81\x30'.decode('gb18030') print(hex(ord(uchar))) # illegal sequence 0x8130FF30 can be decoded to U+060A as well, this should not happen. uchar = b'\x81\x30\xFF\x30' .decode('gb18030') print(hex(ord(uchar))) ---------- components: Unicode messages: 291153 nosy: Ma Lin, ezio.melotti, haypo priority: normal severity: normal status: open title: Range checking in GB18030 decoder type: behavior versions: Python 3.7 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue29990> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com