Hi, Oh, locale.getpreferredencoding(), that's a good question :-)
2017-12-08 6:02 GMT+01:00 INADA Naoki <songofaca...@gmail.com>: > But I want to clarify more about difference/relationship between PEP > 538 and 540. > > If I understand correctly: > > Both of PEP 538 (locale coercion) and PEP 540 (UTF-8 mode) shares > same logic to detect POSIX locale. > > When POSIX locale is detected, locale coercion is tried first. And if > locale coercion > succeeds, UTF-8 mode is not used because locale is not POSIX anymore. No, I would like to enable the UTF-8 mode as well in this case. In short, locale coercion and UTF-8 mode will be both enabled by the POSIX locale. > If locale coercion is disabled or failed, UTF-8 mode is used automatically, > unless it is disabled explicitly. PEP 540 is always enabled if the POSIX locale is detected. Only PYTHONUTF8=0 or -X utf8=0 disable it in this case. Disabling locale coercion doesn't disable the PEP 540. > UTF-8 mode is similar to C.UTF-8 or other locale coercion target locales. > But UTF-8 mode is different from C.UTF-8 locale in these ways because > actual locale is not changed: > > * Libraries using locale (e.g. readline) works as in POSIX locale. So UTF-8 > cannot be used in such libraries. My assumption is that very few C library rely on the locale encoding. The wchar_t* type is rarely used. You may only get issues if Python pass UTF-8 encoded string to a C library which tries to decode it from the locale encoding which is not UTF-8. For example, with the POSIX locale, if the locale encoding is ASCII, you can get a decoding error if a C library tries to decode a UTF-8 encoded string coming from Python. But the encoding problem is not restricted to the current process. For the "producer | consumer" model, if the producer is a Python 3.7 application using UTF-8 mode and so encoding text to UTF-8 to stdout, an application may be unable to decode the UTF-8 data. Here we enter the grey area of encodings. Which applications rely use the locale encoding? Which applications always use UTF-8? Do some applications try UTF-8 first, or falls back on the locale encoding? (OpenSSL does that on filenames for example, as the glib if I recall correctly.) Until we know exactly how UTF-8 is used in the "wild", I chose to make the UTF-8 an opt-in option for locales other than POSIX. I expect a few bugs reports later which will help us to adjust our encodings. > * locale.getpreferredencoding() returns 'ASCII' instead of 'UTF-8'. So > libraries depending on locale.getpreferredencoding() may raise > UnicodeErrors. Right. > Or locale.getpreferredencoding() returns UTF-8 in UTF-8 mode too? Here is where the PEP 538 plays very nicely with the PEP 540. On platforms where the locale coercion is supported (Fedora, macOS, FreeBSD, maybe other Linux distributons), on the POSIX locale, locale.getpreferredencoding() will return UTF-8 and functions like mbstowcs() will use the UTF-8 encoding internally. Currently, in the implementation of my PEP 540, I chose to modify open() to use UTF-8 if the UTF-8 mode is used, rather using locale.getpreferredencoding(). Victor _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com