On 22 May 2017 at 21:28, Thomas Kluyver <tho...@kluyver.me.uk> wrote:
> On Mon, May 22, 2017, at 12:02 PM, Paul Moore wrote:
>> The only reservation I have is that the choice of UTF-8 means that on
>> Windows, build backends pretty much have to explicitly manage tool
>> output (as they are pretty much certain *not* to output in UTF-8).
>> Build backend writers that aren't aware of this issue (most likely
>> because their main platform is not Windows) could very easily choose
>> to just pass through the raw bytes, and as a result *all* non-ASCII
>> output would be garbled on non-UTF-8 systems.
>>
>> Would locale.getpreferredencoding() not be a better choice here? I
>> know it has issues in some situations on Unix, but are they worse than
>> the issues UTF-8 would cause on Windows? After all it's the encoding
>> used by subprocess.Popen in "universal newlines" mode...
>
> What if it wants to send a character which can't be encoded in the
> locale encoding? It's quite easy on Windows to end up with a character
> that you can't encode as cp1252. If the build tool uses .encode(loc_enc,
> 'replace'), then you've lost information even before it gets to the
> install tool.

The counterargument is that there's plenty of text that *can* be
correctly encoded in cp1252 (especially in Europe and LATAM) that will
be rendered incorrectly if the installation tool attempts to interpret
it as UTF-8. CPython itself will also display explicitly UTF-8 encoded
text incorrectly on a Windows console in versions prior to 3.6.

> It's 2017, I really don't want to go down the 'locale specified
> encoding' route again. UTF-8 everywhere!

"UTF-8 everywhere" is fine for network services that only need to talk
to other network services, command line applications, and web
browsers, but even in 2017 it's still a problematic assumption on
client devices running Windows or Linux.

Rather than the locale specified encoding being broken in general, the
key recurring problem we've found with it on *nix systems relates to
the fact that glibc still defaults to ASCII in the C locale - "assume
ASCII really means UTF-8" is enough to solve that problem *without*
breaking compatibility with cp1252 and non-UTF-8 universal encodings.

The other recurring problem is cp1252 itself on Windows, which suffers
from the fact that there isn't a nice environment variable based way
to change the active code page when invoking a subprocess, and also
that cp65001 (the UTF-8 code page) isn't really properly supported in
Python 2.7 (although you can inject a custom search function to alias
it to utf-8 [1]).

Even in that case though, mandating "though shalt treat the streams as
UTF-8" in the spec doesn't *solve* those problems - it just means
we're specifying a behaviour that we know will provide a poor
developer experience on Windows, rather than alerting tool developers
to the fact that this is something they're going to need to be aware
of.

Cheers,
Nick.

[1] http://neurocline.github.io/dev/2016/10/13/python-utf8-windows.html

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
_______________________________________________
Distutils-SIG maillist  -  Distutils-SIG@python.org
https://mail.python.org/mailman/listinfo/distutils-sig

Reply via email to