On 22 May 2017 at 21:28, Thomas Kluyver <tho...@kluyver.me.uk> wrote: > On Mon, May 22, 2017, at 12:02 PM, Paul Moore wrote: >> The only reservation I have is that the choice of UTF-8 means that on >> Windows, build backends pretty much have to explicitly manage tool >> output (as they are pretty much certain *not* to output in UTF-8). >> Build backend writers that aren't aware of this issue (most likely >> because their main platform is not Windows) could very easily choose >> to just pass through the raw bytes, and as a result *all* non-ASCII >> output would be garbled on non-UTF-8 systems. >> >> Would locale.getpreferredencoding() not be a better choice here? I >> know it has issues in some situations on Unix, but are they worse than >> the issues UTF-8 would cause on Windows? After all it's the encoding >> used by subprocess.Popen in "universal newlines" mode... > > What if it wants to send a character which can't be encoded in the > locale encoding? It's quite easy on Windows to end up with a character > that you can't encode as cp1252. If the build tool uses .encode(loc_enc, > 'replace'), then you've lost information even before it gets to the > install tool.
The counterargument is that there's plenty of text that *can* be correctly encoded in cp1252 (especially in Europe and LATAM) that will be rendered incorrectly if the installation tool attempts to interpret it as UTF-8. CPython itself will also display explicitly UTF-8 encoded text incorrectly on a Windows console in versions prior to 3.6. > It's 2017, I really don't want to go down the 'locale specified > encoding' route again. UTF-8 everywhere! "UTF-8 everywhere" is fine for network services that only need to talk to other network services, command line applications, and web browsers, but even in 2017 it's still a problematic assumption on client devices running Windows or Linux. Rather than the locale specified encoding being broken in general, the key recurring problem we've found with it on *nix systems relates to the fact that glibc still defaults to ASCII in the C locale - "assume ASCII really means UTF-8" is enough to solve that problem *without* breaking compatibility with cp1252 and non-UTF-8 universal encodings. The other recurring problem is cp1252 itself on Windows, which suffers from the fact that there isn't a nice environment variable based way to change the active code page when invoking a subprocess, and also that cp65001 (the UTF-8 code page) isn't really properly supported in Python 2.7 (although you can inject a custom search function to alias it to utf-8 [1]). Even in that case though, mandating "though shalt treat the streams as UTF-8" in the spec doesn't *solve* those problems - it just means we're specifying a behaviour that we know will provide a poor developer experience on Windows, rather than alerting tool developers to the fact that this is something they're going to need to be aware of. Cheers, Nick. [1] http://neurocline.github.io/dev/2016/10/13/python-utf8-windows.html -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig