[Python-ideas] Re: Changing the default text encoding of pathlib

Christopher Barker Mon, 25 Jan 2021 12:04:26 -0800

On Sun, Jan 24, 2021 at 6:33 PM Inada Naoki <songofaca...@gmail.com> wrote:


> My previous thread is hijacked about "auto guessing" idea,


yes -- I'm a bit confused by that -- are folks advocating for making some
sort of encoding detection the default? or available as an option in the
stdlib? -- in any case, Ithink that could be an independent proposal.

First: I really want to see this get pushed forward and get done, one way
or another -- using a system setting as a default is a really bad idea in
this day of interconnected computers.

But back to PEP 597, and how to get there:

1) We need to start with a consensus about where we want Python to be in N
versions. That is not specifically laid out in the PEP but it does imply
that in the sometime-long-in-the-future:

- TextIOWrapper will have utf-8 as the default, rather than
`locale.getpreferredencoding(False)`
this behaviour will then be inherited by:
- `open()` without a binary flag in the mode

- `Path.read_text`
- there will be a string that can be passed to encoding that will indicate
that the system default should be used.

(and any other utility functions that use TextIOWrapper)

Forgive me if there is already a consensus on this -- but this discussion
has brought up some thoughts.

1) As TextIOWrapper is an "implementation detail" for most Python
developers, maybe it shouldn't have a default encoding at all, and leave
the default implementation(s) up to the helper functions, like open() and
Path.read_text() -- that would mean changes in more places, but would allow
different utility functions to make different choices.

2) Inada proposed an open_text() function be introduced as a stepping
stone, with the new behaviour. This led to one person asking if that would
imply a open_binary() function as well. An answer to that was no -- as no
one is suggesting any changes to open()'s behavior for binary files.
However, I kind of like the idea. We now have two (at least) different file
objects potentially returned by open(): TextIOWrapper, and
BufferedReader/Writer. And the TextIOWrapper has some pretty different
behavior. I *think* that in virtually all cases, when the code is written,
the author knows whether they want a binary or text file, so it may make
sense to have two different open() functions, rather than having the Type
returned be a function of what mode flags are passed.

This would make it easier for people (and tools) to reason about the code
with static analysis:

e.g.:

open_text().read() would return a string
open_binary().read() would return bytes

This would also make the path to a future with different defaults smoother
-- plain "open" gets deprecated -- any new code uses one of the open_*
functions, and that new code will never need to be changed again.

Back in the day, a single open() function made more sense. After all, the
only difference in the result for binary mode was that linefeed translation
was turned off (and the C legacy of course). In fact, this did lead to
errors, when folks accidentally left off the 'b', and tested only on *nix
systems. That, at least, is less of an issue now; as the text and binary
objects are more different, you are far more likely to get errors right
away -- but still at run time -- static analysis is still tricky.


On to:

> Path.open() was added in Python 3.4. Path.read_text() and

> Path.write_text() was added in Python 3.5.
> Their history is shorter than built-in open(). Changing its default
> encoding should be easier than built-in open and TextIOWrapper.
> New default encodings are:
>
> * read_text() default encoding is "utf-8-sig"
> * write_text() default encoding is "utf-8"
> * open() default encoding is "utf-8-sig" when mode is "r" or None,
> "utf-8" otherwise.
>

How do you think this idea?
>

+1 there is a lot less legacy with Path -- we can move faster. And I
honestly still wonder if making utf-8 the default with cause or fix more
bugs :-)

A thought on that -- there is currently both kinds of code "in the wild":
 (A) code that uses the default, when they really want utf-8 -- currently a
bug, won't be a bug in the future.
 (B) code that uses the default when it really does want the system
encoding. -- currently correct, will become a bug in the future

It's anyone's guess which of these is more common, but one thing to
consider is that (A) is a hidden bug that might reveal itself in the hands
of end users who knows when in the future. Whereas (B) will be a bug that
is likely to reveal itself fairly quickly (though perhaps also in the
(confused) hands of end users as well)

-Chris B

-- 
Christopher Barker, PhD (Chris)

Python Language Consulting
  - Teaching
  - Scientific Software Development
  - Desktop GUI and Web Development
  - wxPython, numpy, scipy, Cython

_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/HTCRMBCTMFTX53YMRQED2WRYI23YUO5I/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Changing the default text encoding of pathlib

Reply via email to