[Python-ideas] Re: Changing the default text encoding of pathlib

Paul Moore Mon, 25 Jan 2021 12:46:47 -0800

On Mon, 25 Jan 2021 at 20:02, Christopher Barker <python...@gmail.com> wrote:
> using a system setting as a default is a really bad idea in this day of 
> interconnected computers.


I'd mildly dispute this. There are (significant) downsides with the
default behaviour being system-dependent, yes, but there are *also*
disadvantages in having Python not behave consistently with other
tools/programs on the same system.

However, on POSIX, things are generally consistent, and *already*
default to UTF-8. So the proposal is mostly going to affect Windows.
And on Windows, there's not much consistency even on a single machine
at the moment. Between OEM and ANSI codepages, and other tools that
default to UTF-8 "because that's the future", there's not much
platform consistency for Python to conform to anyway...

> But back to PEP 597, and how to get there:
>
> 1) We need to start with a consensus about where we want Python to be in N 
> versions. That is not specifically laid out in the PEP but it does imply that 
> in the sometime-long-in-the-future:
>
> - TextIOWrapper will have utf-8 as the default, rather than 
> `locale.getpreferredencoding(False)`
> this behaviour will then be inherited by:
> - `open()` without a binary flag in the mode
>
> - `Path.read_text`
> - there will be a string that can be passed to encoding that will indicate 
> that the system default should be used.
>
> (and any other utility functions that use TextIOWrapper)
>
> Forgive me if there is already a consensus on this -- but this discussion has 
> brought up some thoughts.

There's a fundamental assumption here that I think needs to be made
explicit. Which is that we're assuming that whatever N happens to be,
we anticipate that `locale.getpreferredencoding(False)` will still be
something other than UTF-8. That's *already* false on most POSIX
systems, and TBH I get the impression that Microsoft is pushing quite
hard to move Windows 10 to a UTF-8 by default position (although
"fast" in Microsoft terms may still be slow to the rest of us ;-))

So I think that the real question here is "do we want to move Python
to "UTF8-by-default" faster than the OS vendors are going? And I think
that the answer to that is much less obvious. It probably also depends
heavily on your locale - I doubt it's an accident that Inada-san¹ is
proposing this, and he's from Japan :-) Personally, as an English
speaker based in the UK, I'll be happy when UTF-8 is the default
everywhere, but I can live with the status quo until that happens. But
I'm not the main target for this change.

> 1) As TextIOWrapper is an "implementation detail" for most Python developers, 
> maybe it shouldn't have a default encoding at all, and leave the default 
> implementation(s) up to the helper functions, like open() and 
> Path.read_text() -- that would mean changes in more places, but would allow 
> different utility functions to make different choices.

*shrug*. That sounds plausible, but it's a backward compatibility
break that doesn't offer any significant benefits, so I suspect it's
not worth doing in practice.

> 2) Inada proposed an open_text() function be introduced as a stepping stone, 
> with the new behaviour. This led to one person asking if that would imply a 
> open_binary() function as well. An answer to that was no -- as no one is 
> suggesting any changes to open()'s behavior for binary files.
> However, I kind of like the idea. We now have two (at least) different file 
> objects potentially returned by open(): TextIOWrapper, and 
> BufferedReader/Writer. And the TextIOWrapper has some pretty different 
> behavior. I *think* that in virtually all cases, when the code is written, 
> the author knows whether they want a binary or text file, so it may make 
> sense to have two different open() functions, rather than having the Type 
> returned be a function of what mode flags are passed.
>
> This would make it easier for people (and tools) to reason about the code 
> with static analysis:
>
> e.g.:
>
> open_text().read() would return a string
> open_binary().read() would return bytes

These are good arguments for having explicit open_text and open_binary
functions. I don't *like* the idea, because they feel unnecessarily
verbose to me, but I can accept that this might just be because I'm
used to open().

I do think that having open_text, but *not* having open_binary, would
be a bit confusing. Particularly as pathlib has read_text and
read_binary, so it would be inconsistent as well.

> This would also make the path to a future with different defaults smoother -- 
> plain "open" gets deprecated -- any new code uses one of the open_* 
> functions, and that new code will never need to be changed again.
>
> Back in the day, a single open() function made more sense. After all, the 
> only difference in the result for binary mode was that linefeed translation 
> was turned off (and the C legacy of course). In fact, this did lead to 
> errors, when folks accidentally left off the 'b', and tested only on *nix 
> systems. That, at least, is less of an issue now; as the text and binary 
> objects are more different, you are far more likely to get errors right away 
> -- but still at run time -- static analysis is still tricky.

This, on the other hand, I'm unequivocally against. The sheer quantity
of breakage that would be caused by deprecating open() makes this a
complete non-starter. Even if we only "deprecate in documentation",
we'd be invalidating huge amounts of advice, books and training
materials.

> On to:
>
> > Path.open() was added in Python 3.4. Path.read_text() and
>>
>> Path.write_text() was added in Python 3.5.
>> Their history is shorter than built-in open(). Changing its default
>> encoding should be easier than built-in open and TextIOWrapper.
>> New default encodings are:
>>
>> * read_text() default encoding is "utf-8-sig"
>> * write_text() default encoding is "utf-8"
>> * open() default encoding is "utf-8-sig" when mode is "r" or None,
>> "utf-8" otherwise.
>
>> How do you think this idea?
>
> +1 there is a lot less legacy with Path -- we can move faster. And I honestly 
> still wonder if making utf-8 the default with cause or fix more bugs :-)

But having open(filename) do something different than
Path(filename).open() seems like it's asking for trouble. It would be
a source of a lot of unexpected bugs for people migrating from
filenames as strings to pathlib, and the *last* thing you want during
a migration is having to track down unexpected behavioural differences
you hadn't planned for.

> A thought on that -- there is currently both kinds of code "in the wild":
>  (A) code that uses the default, when they really want utf-8 -- currently a 
> bug, won't be a bug in the future.
>  (B) code that uses the default when it really does want the system encoding. 
> -- currently correct, will become a bug in the future
>
> It's anyone's guess which of these is more common, but one thing to consider 
> is that (A) is a hidden bug that might reveal itself in the hands of end 
> users who knows when in the future. Whereas (B) will be a bug that is likely 
> to reveal itself fairly quickly (though perhaps also in the (confused) hands 
> of end users as well)

There's also (C) code that uses the default, where that default is
already UTF-8. Which is probably most non-Windows systems. Those have
no bug, and this change will make no difference to them.

Also, (A) is "currently a bug, won't be a bug when the system encoding
switches to UTF-8", whereas (B) is "currently correct, will remain
correct when the system default becomes UTF-8". So switching Python's
default can be seen as:

(A) removes an existing bug a bit sooner.
(B) introduces a bug which will go away again when the system switches
to UTF-8 or the user changes their code.
(C) makes no difference.

Frankly, I don't think there's a good answer here, and there will
likely be as many opinions as there are participants in the
discussion.

Paul

¹ I'm not 100% clear on what the polite form of address is for
Japanese names, please let me know if I should be using a different
form :-)
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/VKDWSFDU4WTP3BTPO3LQKVQQDKGOPWDU/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Changing the default text encoding of pathlib

Reply via email to