[issue28712] Non-Windows mappings for a couple of Windows code pages

Eryk Sun Wed, 16 Nov 2016 18:25:27 -0800

Eryk Sun added the comment:

The ANSI and OEM codepages are conveniently supported on a Windows system as 
the encodings 'mbcs' and 'oem' (new in 3.6). The best-fit mapping is used by 
the 'replace' error handler (see the encode_code_page_flags function in 
Objects/unicodeobject.c). For other Windows codepages, while it's not as 
convenient, you can use codecs.code_page_encode. For example:


    >>> codecs.code_page_encode(1252, 'α', 'replace')
    (b'a', 1)

For decoding, MB_ERR_INVALID_CHARS has no effect on decoding single-byte 
codepages because they map every byte. It only affects decoding byte sequences 
that are invalid in multibyte codepages such as 932 and 65001. Without this 
flag, invalid sequences are silently decoded as the codepage's Unicode default 
character. This is usually "?", but for 932 it's Katakana middle dot (U+30FB), 
and for UTF-8 it's U+FFFD. codecs.code_page_decode uses MB_ERR_INVALID_CHARS 
almost always, except not for UTF-7 (see the decode_code_page_flags function). 
So its 'replace' error handling is completely Python's own implementation. For 
example:

MultiByteToWideChar without MB_ERR_INVALID_CHARS:

    >>> decode(932, b'\xe05', strict=False)
    '\u30fb'

versus code_page_decode:

    >>> codecs.code_page_decode(932, b'\xe05', 'replace', True)
    ('\ufffd5', 2)

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue28712>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue28712] Non-Windows mappings for a couple of Windows code pages

Reply via email to