New submission from Ma Lin:

This issue is split from issue24117, that issue became a soup of small issues, 
so I'm going to close it.

For 4-byte GB18030 sequence, the legal range is:
0x81-0xFE for the 1st byte
0x30-0x39 for the 2nd byte
0x81-0xFE for the 3rd byte
0x30-0x39 for the 4th byte
GB18030 standard:
https://en.wikipedia.org/wiki/GB_18030
https://pan.baidu.com/share/link?shareid=2606985291&uk=3341026630

The current code forgets to check 0xFE for the 1st and 3rd byte.
Therefore, there are 8630 illegal 4-byte sequences can be decoded by GB18030 
codec, here is an example:

# legal sequence b'\x81\x31\x81\x30' is decoded to U+060A, it's fine.
uchar = b'\x81\x31\x81\x30'.decode('gb18030')
print(hex(ord(uchar)))

# illegal sequence 0x8130FF30 can be decoded to U+060A as well, this should not 
happen.
uchar = b'\x81\x30\xFF\x30'  .decode('gb18030')
print(hex(ord(uchar)))

----------
components: Unicode
messages: 291153
nosy: Ma Lin, ezio.melotti, haypo
priority: normal
severity: normal
status: open
title: Range checking in GB18030 decoder
type: behavior
versions: Python 3.7

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue29990>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to