On 20 May 2017 at 09:03, Thomas Kluyver <tho...@kluyver.me.uk> wrote: > On Sat, May 20, 2017, at 07:54 AM, Nick Coghlan wrote: >> * on platforms with 8-bit standard streams (e.g. Linux, Mac OS X), >> build systems SHOULD emit UTF-8 encoded output >> * on platforms with 16-bit standard streams (e.g. Windows), build >> systems SHOULD emit UTF-16-LE encoded output > > I'm quite prepared to accept that I'm mistaken, but my understanding is > that *standard streams* are 8-bit on Windows as well. The 16-bit thing > that Python 3.6 does, as I understand it, is to bypass standard streams > when it detects that they're connected to a console, and use a Windows > API call to write text to the console directly as UTF-16. > > If so, when stdout/stderr are pipes, which I assume is how pip captures > the output from build processes, there's no particular reason to send > UTF-16 data just because it's Windows.
That's my understanding too. The standard streams are still byte streams with an encoding. It's just that the underlying IO when the final destination is the console, is done by the Windows Unicode APIs. Because of this, when the output is the console the stream can accept any unicode character and so an encoding of UTF8 is specified (and yes, AIUI there is a translation Unicode string -> UTF-8 bytes -> Unicode console API). For non-console output, though, the standard streams are still byte streams and the platform behaviour is respected, so we use the ANSI codepage (calling this the platform standard glosses over the fact that there are two standard codepages, ANSI and OEM, and tools don't always make the same choice when faced with piped output). Long story short, UTF-16 is irrelevant here. The docs for 3.6 say "Under Windows, if the stream is interactive (that is, if its isatty() method returns True), the console codepage is used, otherwise the ANSI code page". This is out of date (it was true for 3.5 and earlier). In 3.6+ utf-8 is used for interactive streams rather than the console codepage: >py -c "import sys; print(sys.stdout.encoding, file=sys.stderr)" utf-8 >py -c "import sys; print(sys.stdout.encoding, file=sys.stderr)" >$null cp1252 The bigger question, though, is to what extent we want to mandate that build tools that run external tools such as compilers take responsibility for the encoding of the output of those tools (rather than simply passing the output through to the output stream unmodified). And if we do want to, whether we want to allow an exception for setuptools/distutils. Also, a question regarding Unix - do we really want to mandate UTF-8 even if the system locale is set to something else? Won't that mean that build tools have the same problem with compilers generating output in the encoding the tool wants that we already have on Windows? My feeling is: 1. Build systems SHOULD emit output encoded in the preferred locale encoding (normally UTF-8 on Unix, ANSI on Windows). 2. Build systems should ideally check the encoding used by external tools that they run and transcode to the correct encoding if necessary - but this is a quality of implementation matter. 3. Install tools MUST NOT fail if build tools produce output with the wrong encoding, but MUST correctly reproduce build tool output if the build tools do produce the right encoding. My biggest concern with this is that I believe that Visual C produces output in the OEM codepage even when output to a pipe. Actually I just did some experiments (VS 2015), and it's even worse than that. The compiler (cl) seems to use the OEM code page when writing to a pipe, but the linker uses the ANSI code page. This means that a command like "cl a£bc" produces output on (a piped) stdout that contains mixed encodings. Given this situation, I think we have to simply give up and take the view that the Visual C tools are simply broken in this regard, and we shouldn't worry about them. So I'm inclined therefore to drop point (2) from the 3 above. Paul _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig