STINNER Victor <vstin...@redhat.com> added the comment:

> RE making UnixMain public, I'd rather the core runtime require a known 
> encoding, rather than trying to detect it. We should move the call into the 
> detection logic into Programs/python.c so that embedders have to opt-in to 
> detection (many embedding scenarios will prefer to do their own encoding).

Unix is a very complex beast and Python makes it worse by adding more options 
(PEP 538 and PEP 540). Py_UnixMain() works "as expected": it uses the LC_CTYPE 
locale encoding.

If you want to force the usage of UTF-8, you can opt-in for UTF-8 mode: call 
putenv("PYTHONUTF8=1") before Py_UnixMain() for example.

You cannot pass an encoding to Py_UnixMain() because the implementation of 
Python heavily rely on the LC_CTYPE locale: see Py_DecodeLocale() and 
Py_EncodeLocale() functions. Anyway, Python must use the locale encoding to 
avoid mojibake. Python must use the codec from the C library: mbstowcs() and 
wcstombs() to be able to load its own codecs. Python has a few codecs 
implemented in C like ASCII, UTF-8 and Latin1, but locales are way more diverse 
than that. For example, ISO-8859-15 is used for "euro" locale variants. Example:

$ LANG=fr_FR.iso885915@euro python3 -c 'import sys; 
print(sys.getfilesystemencoding())'
iso8859-15

Python has a ISO-8859-15 codec, but it's implemented in pure Python. Python 
uses importlib to laod the codec, but how does Python decodes and encodes 
filenames to import Lib/encodings/iso8859_15.py? That's why 
mbstowcs()/wcstombs() and Py_DecodeLocale()/Py_EncodeLocale() come into the 
game :-) Enjoy:

PyObject*
PyUnicode_DecodeFSDefaultAndSize(const char *s, Py_ssize_t size)
{
    PyInterpreterState *interp = _PyInterpreterState_GET_UNSAFE();
    const _PyCoreConfig *config = &interp->core_config;
#if defined(__APPLE__)
    return PyUnicode_DecodeUTF8Stateful(s, size, config->filesystem_errors, 
NULL);
#else
    /* Bootstrap check: if the filesystem codec is implemented in Python, we
       cannot use it to encode and decode filenames before it is loaded. Load
       the Python codec requires to encode at least its own filename. Use the C
       implementation of the locale codec until the codec registry is
       initialized and the Python codec is loaded. See initfsencoding(). */
    if (interp->fscodec_initialized) {
        return PyUnicode_Decode(s, size,
                                config->filesystem_encoding,
                                config->filesystem_errors);
    }
    else {
        return unicode_decode_locale(s, size,
                                     config->filesystem_errors, 0);
    }
#endif
}

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue36204>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to