On 16Aug2016 1650, Victor Stinner wrote:
2016-08-17 1:27 GMT+02:00 Steve Dower <steve.do...@python.org>:
    filenameb = os.listdir(b'.')[0]
    # Python 3.5 encodes Unicode (UTF-16) to the ANSI code page
    # what if Python 3.7 encodes Unicode (UTF-16) to UTF-8?
    print("filename bytes: %a" % filenameb)

    proc = subprocess.Popen(['py', '-2', script],
stdin=subprocess.PIPE, stdout=subprocess.PIPE)
    stdout = proc.communicate(filenameb)[0]
    print("File content: %a" % stdout)


If you are defining the encoding as 'mbcs', then you need to check that
sys.getfilesystemencoding() == 'mbcs', and if it doesn't then reencode.

Sorry, I don't understand. What do you mean by "defining an encoding"?
It's not possible to modify sys.getfilesystemencoding() in Python.
What does "reencode"? I'm lost.

You are transferring text between two applications without specifying what the encoding is. sys.getfilesystemencoding() does not apply to proc.communicate() - you can use your choice of encoding for communicating between two processes.

Alternatively, since this script is the "new" code, you would use
`os.listdir('.')[0].encode('mbcs')`, given that you have explicitly
determined that mbcs is the encoding for the later transfer.

My example is not new code. It is a very simplified script to explain
the issue that can occur in a large code base which *currently* works
well on Python 2 and Pyhon 3 in the common case (only handle data
encodable to the ANSI code page).

If you are planning to run it with Python 3.6, then I'd argue it's "new" code. When you don't want anything to change, you certainly don't change the major version of your runtime.

Essentially, the problem is that this code is relying on a certain
non-guaranteed behaviour of a deprecated API, where using
sys.getfilesystemencoding() as documented would have prevented any issue
(see
https://docs.python.org/3/library/os.html#file-names-command-line-arguments-and-environment-variables).

sys.getfilesystemencoding() is used in applications which store data
as Unicode, but we are talking about applications storing data as
bytes, no?

No, we're talking about how Python code communicates with the file system. Applications can store their data however they like, but when they pass it to a filesystem function they need to pass it as str or bytes encoding with sys.getfilesystemencoding() (this has always been the case).

So yes, breaking existing code is something I would never do lightly.
However, I'm very much of the opinion that the only code that will break is
code that is already broken (or at least fragile) and that nobody is forced
to take a major upgrade to Python or should necessarily expect 100%
compatibility between major versions.

Well, it's somehow the same issue that we had in Python 2:
applications work in most cases, but start to fail with non-ASCII
characters, or maybe only in some cases.

In this case, the ANSI code page is fine if all data can be encoded to
the ANSI code page. You start to get troubles when you start to use
characters not encodable to your ANSI code page. Last time I checked,
Microsoft Visual Studio behaved badly (has bugs) with such filenames.
It's the same for many applications. So it's not like Windows
applications already handle this case very well. So let me call it a
corner case.

The existence of bugs in other applications is not a good reason to help people create new bugs.

I'm not sure that it's worth it to explicitly break the Python
backward compatibility on Windows for such corner case, especially
because it's already possible to fix applications by starting to use
Unicode everywhere (which would likely fix more issues than expected
as a side effect).

It's still unclear to me if it's simpler to modify an application
using bytes to start using Unicode (for filenames), or if your
proposition requires less changes.

My proposition requires less changes *when you target multiple platforms and would prefer to use bytes*. It allows the below code to be written as either branch without losing the ability to round-trip whatever filename happens to be returned:

if os.name == 'nt':
    f = open(os.listdir('.')[-1])
else:
    f = open(os.listdir(b'.')[-1])

If you choose just the first branch (use str for paths), then you do get a better result. However, we have been telling people to do that since 3.0 (and made it easier in 3.2 IIRC) and it's now 3.5 and they are still complaining about not getting to use bytes for paths. So rather than have people say "Windows support is too hard", this change enables the second branch to be used on all platforms.

My main concern is the "makefile issue" which requires more complex
code to transcode data between UTF-8 and ANSI code page. To me, it's
like we are going back to Python 2 where no data had known encoding
and mojibake was the default. If you manipulate strings in two
encodings, it's likely to make mistakes and concatenate two strings
encoded to two different encodings (=> mojibake).

Your makefile example is going back to Python 2, as it has no known encoding. If you want to associate an encoding with bytes, you decode it to text or you explicitly specify what the encoding should be. Your own example makes assumptions about what encoding the bytes have, which is why it has a bug.

Cheers,
Steve

_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to