Serhiy Storchaka added the comment:

Thank you Eryk. That is what I want. I just missed that code_page_decode() 
returns a tuple.

Seems Windows maps undefined codes to Unicode characters if they are in the 
range 0x80-0x9f and makes an error if they are outside of this range. But if 
the code starts multibyte sequence, the single byte is an error even if it is 
in the range 0x80-0x9f (codepages 932, 949, 950).

This could be emulated by either decoding with errors='surrogateescape' and 
postprocessing the result (replace '\udc80'-'\udc9f' with '\x80'-'\x9f' and 
handle '\udca0'-'\udcff' as error) or writing custom error handler that does 
the job (but perhaps needed several error handlers corresponding 'strict', 
'replace', 'ignore', etc). Adding a new codec of cause is an option too.

There are few other minor differences between Python and Windows:

cp864: On Windows 0x25 is mapped to '%' (U+0025) instead of '٪' (U+066A).
cp932: 0xA0, 0xFD, 0xFE, 0xFF are errors instead of mapping to U+F8F0-U+F8F3.
cp1255: 0xCA is mapped to U+05BA instead of be undefined.

The first two differences can be handled by postprocessing, the latter needs 
changing the codec.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue28712>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to