On 16Aug2016 1650, Victor Stinner wrote:
2016-08-17 1:27 GMT+02:00 Steve Dower <steve.do...@python.org>:
filenameb = os.listdir(b'.')[0]
# Python 3.5 encodes Unicode (UTF-16) to the ANSI code page
# what if Python 3.7 encodes Unicode (UTF-16) to UTF-8?
print("filename bytes: %a" % filenameb)
proc = subprocess.Popen(['py', '-2', script],
stdin=subprocess.PIPE, stdout=subprocess.PIPE)
stdout = proc.communicate(filenameb)[0]
print("File content: %a" % stdout)
If you are defining the encoding as 'mbcs', then you need to check that
sys.getfilesystemencoding() == 'mbcs', and if it doesn't then reencode.
Sorry, I don't understand. What do you mean by "defining an encoding"?
It's not possible to modify sys.getfilesystemencoding() in Python.
What does "reencode"? I'm lost.
You are transferring text between two applications without specifying
what the encoding is. sys.getfilesystemencoding() does not apply to
proc.communicate() - you can use your choice of encoding for
communicating between two processes.
Alternatively, since this script is the "new" code, you would use
`os.listdir('.')[0].encode('mbcs')`, given that you have explicitly
determined that mbcs is the encoding for the later transfer.
My example is not new code. It is a very simplified script to explain
the issue that can occur in a large code base which *currently* works
well on Python 2 and Pyhon 3 in the common case (only handle data
encodable to the ANSI code page).
If you are planning to run it with Python 3.6, then I'd argue it's "new"
code. When you don't want anything to change, you certainly don't change
the major version of your runtime.
Essentially, the problem is that this code is relying on a certain
non-guaranteed behaviour of a deprecated API, where using
sys.getfilesystemencoding() as documented would have prevented any issue
(see
https://docs.python.org/3/library/os.html#file-names-command-line-arguments-and-environment-variables).
sys.getfilesystemencoding() is used in applications which store data
as Unicode, but we are talking about applications storing data as
bytes, no?
No, we're talking about how Python code communicates with the file
system. Applications can store their data however they like, but when
they pass it to a filesystem function they need to pass it as str or
bytes encoding with sys.getfilesystemencoding() (this has always been
the case).
So yes, breaking existing code is something I would never do lightly.
However, I'm very much of the opinion that the only code that will break is
code that is already broken (or at least fragile) and that nobody is forced
to take a major upgrade to Python or should necessarily expect 100%
compatibility between major versions.
Well, it's somehow the same issue that we had in Python 2:
applications work in most cases, but start to fail with non-ASCII
characters, or maybe only in some cases.
In this case, the ANSI code page is fine if all data can be encoded to
the ANSI code page. You start to get troubles when you start to use
characters not encodable to your ANSI code page. Last time I checked,
Microsoft Visual Studio behaved badly (has bugs) with such filenames.
It's the same for many applications. So it's not like Windows
applications already handle this case very well. So let me call it a
corner case.
The existence of bugs in other applications is not a good reason to help
people create new bugs.
I'm not sure that it's worth it to explicitly break the Python
backward compatibility on Windows for such corner case, especially
because it's already possible to fix applications by starting to use
Unicode everywhere (which would likely fix more issues than expected
as a side effect).
It's still unclear to me if it's simpler to modify an application
using bytes to start using Unicode (for filenames), or if your
proposition requires less changes.
My proposition requires less changes *when you target multiple platforms
and would prefer to use bytes*. It allows the below code to be written
as either branch without losing the ability to round-trip whatever
filename happens to be returned:
if os.name == 'nt':
f = open(os.listdir('.')[-1])
else:
f = open(os.listdir(b'.')[-1])
If you choose just the first branch (use str for paths), then you do get
a better result. However, we have been telling people to do that since
3.0 (and made it easier in 3.2 IIRC) and it's now 3.5 and they are still
complaining about not getting to use bytes for paths. So rather than
have people say "Windows support is too hard", this change enables the
second branch to be used on all platforms.
My main concern is the "makefile issue" which requires more complex
code to transcode data between UTF-8 and ANSI code page. To me, it's
like we are going back to Python 2 where no data had known encoding
and mojibake was the default. If you manipulate strings in two
encodings, it's likely to make mistakes and concatenate two strings
encoded to two different encodings (=> mojibake).
Your makefile example is going back to Python 2, as it has no known
encoding. If you want to associate an encoding with bytes, you decode it
to text or you explicitly specify what the encoding should be. Your own
example makes assumptions about what encoding the bytes have, which is
why it has a bug.
Cheers,
Steve
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/