Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

James Y Knight Mon, 27 Apr 2009 22:20:00 -0700


On Apr 27, 2009, at 11:35 PM, Martin v. Löwis wrote:

No. You seem to assume that all bytes < 128 decode successfullyalways.

I believe this assumption is wrong, in general:

py> "\x1b$B' \x1b(B".decode("iso-2022-jp") #2.x syntax
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'iso2022_jp' codec can't decode bytes in position
3-4: illegal multibyte sequence

All bytes are below 128, yet it fails to decode.

Surely nobody uses iso2022 as an LC_CTYPE encoding. That's expresslyforbidden by POSIX, if I'm not mistaken...and I can't see how it wouldwork, considering that it uses all the bytes from 0x20-0x7f, including0x2f ("/"), to represent non-ascii characters.

Hopefully it can be assumed that your locale encoding really is a non-overlapping superset of ASCII, as is required by POSIX...

I'm a bit scared at the prospect that U+DCAF could turn into "/", thatjust screams security vulnerability to me. So I'd like to proposethat only 0x80-0xFF <-> U+DC80-U+DCFF should ever be allowed to beencoded/decoded via the error handler.


James
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Reply via email to