On Tue, May 23, 2017, at 12:56 PM, Paul Moore wrote: > So based on your proposal, won't you introduce similar bugs by using > print() without sorting out encodings? Unless (see below) you assume > that the frontend sorts it out for you.
If you strictly follow the locale encoding, you need to sort it out in Python anyway, in case the stdout encoding has been overridden by PYTHONIOENCODING, or PYTHONSTARTUP, or the infernal .pth files. I accept that those are corner cases, though. > Yes, subprocesses that produce a known encoding are trivial to deal > with. But remembering that you *need* to deal with them less so. My > concern here is the same one as you quote above - assuming that > subprocess returns UTF-8 encoded bytes, because it does on Linux and > Mac. I agree, that is a concern. > But if you genuinely don't know (or worse, know there is no consistent > encoding) I'm not sure I see how passing unknown bytes onto the > frontend, which by necessity has less context to guess what those > bytes might mean, is the right answer. The frontend is better able to > know what it wants to *do* with those bytes, but "convert them to text > for the user to see" is the only real answer here IMO (sure, dumping > the raw bytes to a file might be an option, but I imagine it'll be a > relatively uncommon choice). I was indeed thinking of dumping them to a file. It's not very user friendly, but it means the information is there if you need it. I suspect that regardless of the locale, technical information like code and filesystem paths will often contain enough ASCII that a human can interpret them even if non-ASCII characters are wrongly encoded. So I hope that needing to reverse-engineer the encoding will be relatively rare. The appeal of this is that it follows "in the face of ambiguity, refuse the temptation to guess". If the backend guesses the encoding incorrectly, the frontend gets valid UTF-8, but is no better able to display it meaningfully, and you then need to go through decode-encode-decode to recover the original text, even if no data was lost. Another option: if the backend runs a subprocess with unknown output encoding, it redirects that output to a temp file and prints the path in its own output. Then there's a better chance that the unknown encoding is at least consistent within the file, so tools can do encoding detection on it. > At the end of the day, there is no perfect answer here. Someone is > going to have to make a judgement call, and as the PEP author, I guess > that's you. So at this point I'll stop badgering you and leave it up > to you to decide what the consensus is. Thanks for listening to my > points, though. I know what I think, but I don't feel like there's a consensus as yet. Can I take a quick poll of what people following this topic think? Q1: Default encoding for captured build stdout/stderr a. UTF-8 (consistent, can represent any character) b. Locale default (convenient if backend runs subprocesses which produce output in the locale encoding) Q2: Handling unknown encodings from subprocesses a. Backend should ensure all output is valid in the target encoding (Q1), though it may not be accurate. b. Unknown output may be passed on as bytes without transcoding, so the frontend can e.g. dump it to a file. I'm currently 1:a, 2:?a . Thomas _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig