[Python-ideas] Re: Make UTF-8 mode more accessible for Windows users.

Inada Naoki Thu, 28 Jan 2021 16:28:49 -0800

On Fri, Jan 29, 2021 at 4:00 AM Christopher Barker <[email protected]> wrote:
>
> The "real" solution is to change the defaults not to use the system encoding 
> at all -- which, of course, we are moving towards with PEP 597. So first a 
> plug to do that as fast as possible! I myself would love to see PEP 597 
> implemented tomorrow -- for all supported versions of Python.
>


Note that PEP 597 doesn't change the default encoding. It just adds an
option to emit a warning when the default encoding is used.
I think it might take about 10 years to change it.

> However, the real trick here is that Python is a programming 
> language/library/runtime -- not an application. So the folks starting up the 
> interpreter are very often NOT the same as the folks writing the code.
>
> And this is why this is the issue it is -- folks write code on *nix systems, 
> or maybe Windows with utf-8 as a system encoding, or only test with ASCII 
> data, or ...  -- then someone else actually runs the code, on Windows, and it 
> doesn't work. Even if the person is technically writing the code, they may 
> have copy and pasted it or who knows what? Think about it -- of all the 
> Python code you run (libraries, etc) -- how much of it did you write yourself?
>
> (I myself have been highly negligent with my teaching materials in this 
> regard --  so have personally unleashed dozens of folks writting buggy code 
> on the world.)

You are right. Many codes are written by other people. It cause
UnicodeDecodeError on Windows.
And UTF-8 mode rescues it.

>
> Anyway -- I'm afraid any combination of start-up flags, environment 
> variables, etc. will not be enough -- is there a way to enable UTF-8 mode in 
> the code, e.g. with a __future__ import?
> This may be impossible, as UTF-8 mode is an interpreter global setting, and 
> it could get very messy if a __future import__ in one library changes the 
> behavior of all the other code -- but maybe there's some way to accomplish 
> something similar?
>
> Could monkey patch open() for that module, but would there be any way to have 
> it work, on a module basis, for all other uses of TextIOWrapper?

UTF-8 mode is used to decode command-line arguments and environment
variables on Unix. So UTF-8 mode can be enabled only at startup for
now.
This restriction is caused by Unix so I think we can add something
like `sys._enable_utf8_mode()` only on Windows if it is really needed.
But it means codes using `sys._enable_utf8_mode()` are Windows-only.
It doesn't make sense.

Another way is adding runtime option to change only the default text
encoding. (e.g. `io.set_default_encoding("utf-8")`)
This is a considerable option. When we add it on the top of scripts or
Notebook, it uses UTF-8 to open files on all platforms.

On the other hand, it adds another "xxx encoding" terminology to
Python. Python has too many "xxx encoding"s and it confuses users.
So I am cautious about adding another encoding option and focus on
UTF-8 mode now.

>
> Maybe one work around would be for the __future__ import (Or something) to 
> set the mode, and then trigger warnings for all uses of TextIOWrapper that 
> don't use utf-8 -- that us turn on PEP597
>
> So you'd use one library that had the __future__ import, and it wouldn't 
> break any other code,  but it would turn on Warnings.
>

Please don't discuss PEP 597 in this thread. Let's focus on UTF-8 mode.
They are different approaches and they are not mutually exclusive.

* UTF-8 mode helps users who see UnicodeDecodeError while `pip install`.
* PEP 597 helps developers to notice `open("README.md").read()` in `setup.py`.


> Anyway, this is a very hard problem, but what I'm trying to get at is that we 
> don't want the exact same code to run differently depending on what 
> environment it's running in. Currently, it depends on the system encoding, 
> we'd just be switching to it depending on whether utf-mode is turned on, 
> which is better, I suppose, (e.g Jupyter could choose to turn utf-mode on by 
> default for example), but would still have the same fundamental problem.
>
> Imagine someone runs some code in Jupyter, and it's fine, and then they run 
> it in plain Python, on the same machine, and it breaks -- ouch!
>

You are right. UTF-8 mode must be accessible for both of Jupyter on
conda Python and Python installed by official installer.
If UTF-8 mode is accessible enough, user can fix it by enabling UTF-8 mode.


> BTW: is there a way at runtime to check for UTF8 mode? Then at least I could 
> raise a warning in my code. Or maybe simply check if 
> locale.getpreferredencoding() returns utf-8, and raise a warning if not.

There is `sys.flags.utf8_mode`. But UTF-8 mode is not used on most
Unix users because locale encoding is UTF-8.
So checking `locale.getpreferredencoding(False)` is better.
But note that `locale.getpreferredencoding(False)` may return "utf8",
"utf-8", "utf_8", "UTF-8"...

> That wouldn't be hard to do, but it might be worth having a small utility 
> that does it in a _future__import:
>
> from __future__ import warn_if_not_utf8

It seems you are misusing __future__ import. __future__ import is for
compilers and parsers. It is not for runtime behavior.
And I don't think we should add `warn_if_not_utf8()` for now.

>>
>> Is it possible to enable UTF-8 mode in a configuration file like 
>> `pyvenv.cfg`?
>
> I can't see how that's any more powerful/flexible than an environment 
> variable.
>

It is powerful/flexible for power users. But not for beginners.
Imagine users execute Jupyter from the start menu.

* Command-line `-Xutf8` or `set PYTHONUTF8=1` is not accessible.
* User environment variable is not accessible too, and it may affect
other Python installations.


>> Is it possible to make it easier to configure?
>>
>> * Put a checkbox in the installer?
>> * Provide a small tool to allow configuration after installation?
>>   * python3 -m utf8mode enable|disable?
>>     * Accessible only for CLI user
>>       * Add "Enable UTF-8 mode" and "Disable UTF-8 mode" to Start menu?
>
>
> This is still going to have the same fundamental problems of the same code 
> running differently on different machines or even the same machine in 
> different environments, installs -- someone upgrades and forgets to check 
> that box again, etc ....
>

There are pros and cons.

If we use user-wide (or system-wide) setting like `PYTHONUTF8` in user
environment variable, all Python environments use UTF-8 mode
consistently.
But it will break legacy applications running on old Python environment.
If we have per-environment option, it's easy to recommend users to
enable UTF-8 mode.

> Maybe this would be a good thing to do once there are Warnings in place?
>

Do you mean programs only runs on UTF-8 mode warns if UTF-8 mode is
not enabled? e.g.

```
if sys.platform == "win32" and not sys.flags.utf8_mode:
    sys.exit("This programs runs only on UTF-8 mode. Please enable UTF-8 mode.")
```

Then, I don't like it... Windows only API to enable UTF-8 mode in
runtime seems better.

```
if sys.platform == "win32":
    sys._win32_enable_utf8mode()
```

Regards,

-- 
Inada Naoki  <[email protected]>
_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/KZYWEPFI4TNBBOJB3ZFGVTRWKL73XXRO/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Make UTF-8 mode more accessible for Windows users.

Reply via email to