Re: [Python-ideas] Fix default encodings on Windows

Victor Stinner Tue, 16 Aug 2016 16:52:31 -0700

2016-08-17 1:27 GMT+02:00 Steve Dower <steve.do...@python.org>:
>>     filenameb = os.listdir(b'.')[0]
>>     # Python 3.5 encodes Unicode (UTF-16) to the ANSI code page
>>     # what if Python 3.7 encodes Unicode (UTF-16) to UTF-8?
>>     print("filename bytes: %a" % filenameb)
>>
>>     proc = subprocess.Popen(['py', '-2', script],
>> stdin=subprocess.PIPE, stdout=subprocess.PIPE)
>>     stdout = proc.communicate(filenameb)[0]
>>     print("File content: %a" % stdout)
>
>
> If you are defining the encoding as 'mbcs', then you need to check that
> sys.getfilesystemencoding() == 'mbcs', and if it doesn't then reencode.


Sorry, I don't understand. What do you mean by "defining an encoding"?
It's not possible to modify sys.getfilesystemencoding() in Python.
What does "reencode"? I'm lost.


> Alternatively, since this script is the "new" code, you would use
> `os.listdir('.')[0].encode('mbcs')`, given that you have explicitly
> determined that mbcs is the encoding for the later transfer.

My example is not new code. It is a very simplified script to explain
the issue that can occur in a large code base which *currently* works
well on Python 2 and Pyhon 3 in the common case (only handle data
encodable to the ANSI code page).


> Essentially, the problem is that this code is relying on a certain
> non-guaranteed behaviour of a deprecated API, where using
> sys.getfilesystemencoding() as documented would have prevented any issue
> (see
> https://docs.python.org/3/library/os.html#file-names-command-line-arguments-and-environment-variables).

sys.getfilesystemencoding() is used in applications which store data
as Unicode, but we are talking about applications storing data as
bytes, no?


> So yes, breaking existing code is something I would never do lightly.
> However, I'm very much of the opinion that the only code that will break is
> code that is already broken (or at least fragile) and that nobody is forced
> to take a major upgrade to Python or should necessarily expect 100%
> compatibility between major versions.

Well, it's somehow the same issue that we had in Python 2:
applications work in most cases, but start to fail with non-ASCII
characters, or maybe only in some cases.

In this case, the ANSI code page is fine if all data can be encoded to
the ANSI code page. You start to get troubles when you start to use
characters not encodable to your ANSI code page. Last time I checked,
Microsoft Visual Studio behaved badly (has bugs) with such filenames.
It's the same for many applications. So it's not like Windows
applications already handle this case very well. So let me call it a
corner case.

I'm not sure that it's worth it to explicitly break the Python
backward compatibility on Windows for such corner case, especially
because it's already possible to fix applications by starting to use
Unicode everywhere (which would likely fix more issues than expected
as a side effect).

It's still unclear to me if it's simpler to modify an application
using bytes to start using Unicode (for filenames), or if your
proposition requires less changes.

My main concern is the "makefile issue" which requires more complex
code to transcode data between UTF-8 and ANSI code page. To me, it's
like we are going back to Python 2 where no data had known encoding
and mojibake was the default. If you manipulate strings in two
encodings, it's likely to make mistakes and concatenate two strings
encoded to two different encodings (=> mojibake).

Victor
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

Reply via email to