Looks nice. But I want to clarify more about difference/relationship between PEP 538 and 540.
If I understand correctly: Both of PEP 538 (locale coercion) and PEP 540 (UTF-8 mode) shares same logic to detect POSIX locale. When POSIX locale is detected, locale coercion is tried first. And if locale coercion succeeds, UTF-8 mode is not used because locale is not POSIX anymore. If locale coercion is disabled or failed, UTF-8 mode is used automatically, unless it is disabled explicitly. UTF-8 mode is similar to C.UTF-8 or other locale coercion target locales. But UTF-8 mode is different from C.UTF-8 locale in these ways because actual locale is not changed: * Libraries using locale (e.g. readline) works as in POSIX locale. So UTF-8 cannot be used in such libraries. * locale.getpreferredencoding() returns 'ASCII' instead of 'UTF-8'. So libraries depending on locale.getpreferredencoding() may raise UnicodeErrors. Am I correct? Or locale.getpreferredencoding() returns UTF-8 in UTF-8 mode too? INADA Naoki <songofaca...@gmail.com> On Fri, Dec 8, 2017 at 9:50 AM, Victor Stinner <victor.stin...@gmail.com> wrote: > Hi, > > I made the following two changes to the PEP 540: > > * open() error handler remains "strict" > * remove the "Strict UTF8 mode" which doesn't make much sense anymore > > I wrote the Strict UTF-8 mode when open() used surrogateescape error > handler in the UTF-8 mode. I don't think that a Strict UTF-8 mode is > required just to change the error handler of stdin and stdout. Well, > read the "Passthough undecodable bytes: surrogateescape" section of > the PEP rationale :-) > > > https://www.python.org/dev/peps/pep-0540/ > > Victor > > > PEP: 540 > Title: Add a new UTF-8 mode > Version: $Revision$ > Last-Modified: $Date$ > Author: Victor Stinner <victor.stin...@gmail.com> > BDFL-Delegate: INADA Naoki > Status: Draft > Type: Standards Track > Content-Type: text/x-rst > Created: 5-January-2016 > Python-Version: 3.7 > > > Abstract > ======== > > Add a new UTF-8 mode to ignore the locale, use the UTF-8 encoding, and > change ``stdin`` and ``stdout`` error handlers to ``surrogateescape``. > This mode is enabled by default in the POSIX locale, but otherwise > disabled by default. > > The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment > variable are added to control the UTF-8 mode. > > > Rationale > ========= > > Locale encoding and UTF-8 > ------------------------- > > Python 3.6 uses the locale encoding for filenames, environment > variables, standard streams, etc. The locale encoding is inherited from > the locale; the encoding and the locale are tightly coupled. > > Many users inherit the ASCII encoding from the POSIX locale, aka the "C" > locale, but are unable change the locale for different reasons. This > encoding is very limited in term of Unicode support: any non-ASCII > character is likely to cause troubles. > > It is not easy to get the expected locale. Locales don't get the exact > same name on all Linux distributions, FreeBSD, macOS, etc. Some > locales, like the recent ``C.UTF-8`` locale, are only supported by a few > platforms. For example, a SSH connection can use a different encoding > than the filesystem or terminal encoding of the local host. > > On the other side, Python 3.6 is already using UTF-8 by default on > macOS, Android and Windows (PEP 529) for most functions, except of > ``open()``. UTF-8 is also the default encoding of Python scripts, XML > and JSON file formats. The Go programming language uses UTF-8 for > strings. > > When all data are stored as UTF-8 but the locale is often misconfigured, > an obvious solution is to ignore the locale and use UTF-8. > > PEP 538 attempts to mitigate this problem by coercing the C locale > to a UTF-8 based locale when one is available, but that isn't a > universal solution. For example, CentOS 7's container images default > to the POSIX locale, and don't include the C.UTF-8 locale, so PEP 538's > locale coercion is ineffective. > > > Passthough undecodable bytes: surrogateescape > --------------------------------------------- > > When decoding bytes from UTF-8 using the ``strict`` error handler, which > is the default, Python 3 raises a ``UnicodeDecodeError`` on the first > undecodable byte. > > Unix command line tools like ``cat`` or ``grep`` and most Python 2 > applications simply do not have this class of bugs: they don't decode > data, but process data as a raw bytes sequence. > > Python 3 already has a solution to behave like Unix tools and Python 2: > the ``surrogateescape`` error handler (:pep:`383`). It allows to process > data "as bytes" but uses Unicode in practice (undecodable bytes are > stored as surrogate characters). > > The UTF-8 mode uses the ``surrogateescape`` error handler for ``stdin`` > and ``stdout`` since these streams as commonly associated to Unix > command line tools. > > However, users have a different expectation on files. Files are expected > to be properly encoded. Python is expected to fail early when ``open()`` > is called with the wrong options, like opening a JPEG picture in text > mode. The ``open()`` default error handler remains ``strict`` for these > reasons. > > > No change by default for best backward compatibility > ---------------------------------------------------- > > While UTF-8 is perfect in most cases, sometimes the locale encoding is > actually the best encoding. > > This PEP changes the behaviour for the POSIX locale since this locale > usually gives the ASCII encoding, whereas UTF-8 is a much better choice. > It does not change the behaviour for other locales to prevent any risk > or regression. > > As users are responsible to enable explicitly the new UTF-8 mode, they > are responsible for any potential mojibake issues caused by this mode. > > > Proposal > ======== > > Add a new UTF-8 mode to ignore the locale, use the UTF-8 encoding, and > change ``stdin`` and ``stdout`` error handlers to ``surrogateescape``. > This mode is enabled by default in the POSIX locale, but otherwise > disabled by default. > > The new ``-X utf8`` command line option and ``PYTHONUTF8`` environment > variable are added. The UTF-8 mode is enabled by ``-X utf8`` or > ``PYTHONUTF8=1``. > > The POSIX locale enables the UTF-8 mode. In this case, the UTF-8 mode > can be explicitly disabled by ``-X utf8=0`` or ``PYTHONUTF8=0``. > > For standard streams, the ``PYTHONIOENCODING`` environment variable has > priority over the UTF-8 mode. > > On Windows, the ``PYTHONLEGACYWINDOWSFSENCODING`` environment variable > (:pep:`529`) has the priority over the UTF-8 mode. > > > Backward Compatibility > ====================== > > The only backward incompatible change is that the UTF-8 encoding is now > used for the POSIX locale. > > > Annex: Encodings And Error Handlers > =================================== > > The UTF-8 mode changes the default encoding and error handler used by > ``open()``, ``os.fsdecode()``, ``os.fsencode()``, ``sys.stdin``, > ``sys.stdout`` and ``sys.stderr``. > > Encoding and error handler > -------------------------- > > ============================ ======================= > ========================== > Function Default UTF-8 mode or > POSIX locale > ============================ ======================= > ========================== > open() locale/strict **UTF-8**/strict > os.fsdecode(), os.fsencode() locale/surrogateescape > **UTF-8**/surrogateescape > sys.stdin, sys.stdout locale/strict > **UTF-8/surrogateescape** > sys.stderr locale/backslashreplace > **UTF-8**/backslashreplace > ============================ ======================= > ========================== > > By comparison, Python 3.6 uses: > > ============================ ======================= > ========================== > Function Default POSIX locale > ============================ ======================= > ========================== > open() locale/strict locale/strict > os.fsdecode(), os.fsencode() locale/surrogateescape locale/surrogateescape > sys.stdin, sys.stdout locale/strict > locale/**surrogateescape** > sys.stderr locale/backslashreplace locale/backslashreplace > ============================ ======================= > ========================== > > Encoding and error handler on Windows > ------------------------------------- > > On Windows, the encodings and error handlers are different: > > ============================ ======================= > ========================== ========================== > Function Default Legacy Windows > FS encoding UTF-8 mode > ============================ ======================= > ========================== ========================== > open() mbcs/strict mbcs/strict > **UTF-8**/strict > os.fsdecode(), os.fsencode() UTF-8/surrogatepass > **mbcs/replace** UTF-8/surrogatepass > sys.stdin, sys.stdout UTF-8/surrogateescape > UTF-8/surrogateescape UTF-8/surrogateescape > sys.stderr UTF-8/backslashreplace > UTF-8/backslashreplace UTF-8/backslashreplace > ============================ ======================= > ========================== ========================== > > By comparison, Python 3.6 uses: > > ============================ ======================= > ========================== > Function Default Legacy Windows > FS encoding > ============================ ======================= > ========================== > open() mbcs/strict mbcs/strict > os.fsdecode(), os.fsencode() UTF-8/surrogatepass **mbcs/replace** > sys.stdin, sys.stdout UTF-8/surrogateescape UTF-8/surrogateescape > sys.stderr UTF-8/backslashreplace UTF-8/backslashreplace > ============================ ======================= > ========================== > > The "Legacy Windows FS encoding" is enabled by the > ``PYTHONLEGACYWINDOWSFSENCODING`` environment variable. > > If stdin and/or stdout is redirected to a pipe, ``sys.stdin`` and/or > ``sys.output`` use ``mbcs`` encoding by default rather than UTF-8. But > in the UTF-8 mode, ``sys.stdin`` and ``sys.stdout`` always use the UTF-8 > encoding. > > .. note: > There is no POSIX locale on Windows. The ANSI code page is used to the > locale encoding, and this code page never uses the ASCII encoding. > > > Annex: Differences between PEP 538 and PEP 540 > ============================================== > > PEP 538's locale coercion is only effective if a suitable UTF-8 > based locale is available as a coercion target. PEP 540's > UTF-8 mode can be enabled even for operating systems that don't > provide a suitable platform locale (such as CentOS 7). > > PEP 538 only changes the interpreter's behaviour for the C locale. While the > new UTF-8 mode of this PEP is only enabled by default in the C locale, it can > also be enabled manually for any other locale. > > PEP 538 is implemented with ``setlocale(LC_CTYPE, "<coercion target>")`` and > ``setenv("LC_CTYPE", "<coercion target>")``, so any non-Python code running > in the process and any subprocesses that inherit the environment is impacted > by the change. PEP 540 is implemented in Python internals and ignores the > locale: non-Python running in the same process is not aware of the > "Python UTF-8 mode". The benefit of the PEP 538 approach is that it helps > ensure that encoding handling in binary extension modules and subprocesses > is consistent with CPython's encoding handling. The upside of the PEP 540 > approach is that it allows an embedding application to change the > interpreter's behaviour without having to change the process global > locale settings. > > > Links > ===== > > * `bpo-29240: Implementation of the PEP 540: Add a new UTF-8 mode > <http://bugs.python.org/issue29240>`_ > * `PEP 538 <https://www.python.org/dev/peps/pep-0538/>`_: > "Coercing the legacy C locale to C.UTF-8" > * `PEP 529 <https://www.python.org/dev/peps/pep-0529/>`_: > "Change Windows filesystem encoding to UTF-8" > * `PEP 528 <https://www.python.org/dev/peps/pep-0528/>`_: > "Change Windows console encoding to UTF-8" > * `PEP 383 <https://www.python.org/dev/peps/pep-0383/>`_: > "Non-decodable Bytes in System Character Interfaces" > > > Post History > ============ > > * 2017-12: `[Python-Dev] PEP 540: Add a new UTF-8 mode > <https://mail.python.org/pipermail/python-dev/2017-December/151054.html>`_ > * 2017-04: `[Python-Dev] Proposed BDFL Delegate update for PEPs 538 & > 540 (assuming UTF-8 for *nix system boundaries) > <https://mail.python.org/pipermail/python-dev/2017-April/147795.html>`_ > * 2017-01: `[Python-ideas] PEP 540: Add a new UTF-8 mode > <https://mail.python.org/pipermail/python-ideas/2017-January/044089.html>`_ > * 2017-01: `bpo-28180: Implementation of the PEP 538: coerce C locale to > C.utf-8 (msg284764) <https://bugs.python.org/issue28180#msg284764>`_ > * 2016-08-17: `bpo-27781: Change sys.getfilesystemencoding() on Windows > to UTF-8 (msg272916) <https://bugs.python.org/issue27781#msg272916>`_ > -- Victor proposed ``-X utf8`` for the :pep:`529` (Change Windows > filesystem encoding to UTF-8) > > > Copyright > ========= > > This document has been placed in the public domain. > _______________________________________________ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com