Eryk Sun added the comment:
The ANSI and OEM codepages are conveniently supported on a Windows system as
the encodings 'mbcs' and 'oem' (new in 3.6). The best-fit mapping is used by
the 'replace' error handler (see the encode_code_page_flags function in
Objects/unicodeobject.c). For other Windows codepages, while it's not as
convenient, you can use codecs.code_page_encode. For example:
>>> codecs.code_page_encode(1252, 'α', 'replace')
(b'a', 1)
For decoding, MB_ERR_INVALID_CHARS has no effect on decoding single-byte
codepages because they map every byte. It only affects decoding byte sequences
that are invalid in multibyte codepages such as 932 and 65001. Without this
flag, invalid sequences are silently decoded as the codepage's Unicode default
character. This is usually "?", but for 932 it's Katakana middle dot (U+30FB),
and for UTF-8 it's U+FFFD. codecs.code_page_decode uses MB_ERR_INVALID_CHARS
almost always, except not for UTF-7 (see the decode_code_page_flags function).
So its 'replace' error handling is completely Python's own implementation. For
example:
MultiByteToWideChar without MB_ERR_INVALID_CHARS:
>>> decode(932, b'\xe05', strict=False)
'\u30fb'
versus code_page_decode:
>>> codecs.code_page_decode(932, b'\xe05', 'replace', True)
('\ufffd5', 2)
----------
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue28712>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com