[issue46668] encodings: the "mbcs" alias doesn't work

Eryk Sun Mon, 07 Feb 2022 09:54:02 -0800


Eryk Sun <eryk...@gmail.com> added the comment:


> I don't think that this fallback is needed anymore. Which Windows
> code page can be used as ANSI code page which is not already 
> implemented as a Python codec?

Python has full coverage of the ANSI and OEM code pages in the standard Windows 
locales, but I don't have any experience with custom (i.e. supplemental or 
replacement) locales.

https://docs.microsoft.com/en-us/windows/win32/intl/custom-locales 

Here's a simple script to check the standard locales.

    import codecs
    import ctypes
    kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)

    LOCALE_ALL = 0
    LOCALE_WINDOWS = 1
    LOCALE_IDEFAULTANSICODEPAGE = 0x1004
    LOCALE_IDEFAULTCODEPAGE = 0x000B # OEM

    EnumSystemLocalesEx = kernel32.EnumSystemLocalesEx
    GetLocaleInfoEx = kernel32.GetLocaleInfoEx
    GetCPInfoExW = kernel32.GetCPInfoExW

    EnumLocalesProcEx = ctypes.WINFUNCTYPE(ctypes.c_int,
        ctypes.c_wchar_p, ctypes.c_ulong, ctypes.c_void_p)

    class CPINFOEXW(ctypes.Structure):
         _fields_ = (('MaxCharSize', ctypes.c_uint),
                     ('DefaultChar', ctypes.c_ubyte * 2),
                     ('LeadByte', ctypes.c_ubyte * 12),
                     ('UnicodeDefaultChar', ctypes.c_wchar),
                     ('CodePage', ctypes.c_uint),
                     ('CodePageName', ctypes.c_wchar * 260))

    def get_all_locale_code_pages():
        result = []
        seen = set()
        info = (ctypes.c_wchar * 100)()

        @EnumLocalesProcEx
        def callback(locale, flags, param):
            for lctype in (LOCALE_IDEFAULTANSICODEPAGE, 
LOCALE_IDEFAULTCODEPAGE):
                if (GetLocaleInfoEx(locale, lctype, info, len(info)) and
                      info.value not in ('0', '1')):
                    cp = int(info.value)
                    if cp in seen:
                        continue
                    seen.add(cp)
                    cp_info = CPINFOEXW()
                    if not GetCPInfoExW(cp, 0, ctypes.byref(cp_info)):
                        cp_info.CodePage = cp
                        cp_info.CodePageName = str(cp)
                    result.append(cp_info)
            return True

        if not EnumSystemLocalesEx(callback, LOCALE_WINDOWS, None, None):
            raise ctypes.WinError(ctypes.get_last_error())

        result.sort(key=lambda x: x.CodePage)
        return result

    supported = []
    unsupported = []
    for cp_info in get_all_locale_code_pages():
        cp = cp_info.CodePage
        try:
            codecs.lookup(f'cp{cp}')
        except LookupError:
            unsupported.append(cp_info)
        else:
            supported.append(cp_info)

    if unsupported:
        print('Unsupported:\n')
        for cp_info in unsupported:
            print(cp_info.CodePageName)
        print('\nSupported:\n')
    else:
        print('All Supported:\n')
    for cp_info in supported:
        print(cp_info.CodePageName)


Output:

    All Supported:

    437   (OEM - United States)
    720   (Arabic - Transparent ASMO)
    737   (OEM - Greek 437G)
    775   (OEM - Baltic)
    850   (OEM - Multilingual Latin I)
    852   (OEM - Latin II)
    855   (OEM - Cyrillic)
    857   (OEM - Turkish)
    862   (OEM - Hebrew)
    866   (OEM - Russian)
    874   (ANSI/OEM - Thai)
    932   (ANSI/OEM - Japanese Shift-JIS)
    936   (ANSI/OEM - Simplified Chinese GBK)
    949   (ANSI/OEM - Korean)
    950   (ANSI/OEM - Traditional Chinese Big5)
    1250  (ANSI - Central Europe)
    1251  (ANSI - Cyrillic)
    1252  (ANSI - Latin I)
    1253  (ANSI - Greek)
    1254  (ANSI - Turkish)
    1255  (ANSI - Hebrew)
    1256  (ANSI - Arabic)
    1257  (ANSI - Baltic)
    1258  (ANSI/OEM - Viet Nam)

Some locales are Unicode only (e.g. Hindi-India) or have no OEM code page, 
which the above code skips by checking for "0" or "1" as the code page value. 
Windows 10+ allows setting the system locale to a Unicode-only locale, for 
which it uses UTF-8 (65001) for ANSI and OEM.

The OEM code page matters because the console input and output code pages 
default to OEM, e.g. for os.device_encoding(). The console's I/O code pages are 
used in Python by low-level os.read() and os.write(). Note that the console 
doesn't properly implement using UTF-8 (65001) as the input code page. In this 
case, input read from the console via ReadFile() or ReadConsoleA() has a null 
byte in place of each non-ASCII character.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue46668>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue46668] encodings: the "mbcs" alias doesn't work

Reply via email to