On 23 May 2017 at 22:41, Thomas Kluyver <tho...@kluyver.me.uk> wrote: > On Tue, May 23, 2017, at 12:56 PM, Paul Moore wrote: > Can I take a quick poll of what people following this topic think? > > Q1: Default encoding for captured build stdout/stderr > a. UTF-8 (consistent, can represent any character) > b. Locale default (convenient if backend runs subprocesses which produce > output in the locale encoding) > > Q2: Handling unknown encodings from subprocesses > a. Backend should ensure all output is valid in the target encoding > (Q1), though it may not be accurate. > b. Unknown output may be passed on as bytes without transcoding, so the > frontend can e.g. dump it to a file.
Up to this point, I've been in favour of both 1b and 2b, since they're the main options that allow a build backend to get itself out of the way entirely and let the front-end deal with the problem rather than having to figure out encoding issues for themselves. pip's already has to deal with the "arbitrarily encoded data" problem for the current setup.py invocation, and whatever solution is adopted there should suffice for PEP 517 as well. If PEP 426 taught me anything, it was that if you weren't planning to write something yourself, and didn't have the budget to pay someone else to write it for you, your best bet is to adhere as closely to the status quo as you can while still incorporating the 100% essential changes that you actually need. (A Zen of Python style aphorism for that: "The right way and the easy way should be the same way") To be honest, I still think that's likely to be the right way to go for PEP 517, and will take some convincing that we're going to be able to persuade future backend developers that personally couldn't care less about encoding issues to adopt anything more complex. However, I also realised that there's a potential third way to handle this problem: design a Python level API that allows front ends to use more structured data formats (e.g. JSON) for communication between the frontend and their backend shim. In particular, I'm thinking we could move the current "config_settings" dict onto something like a "build context" object that, *even in Python 2*, offers a Unicode "outstream" and "errstream", which the backend is then expected to use rather than writing to sys.stdout/err directly. That context could also provide a Python 3 style "run()" API for subprocess invocation that implemented the preferred stream handling behaviour for subprocess invocation (including applying the "backslashreplace" error handler regardless of version) That way, instead of trying to hit build backend developers with a fairly flimsy stick ("Thou shalt comply with the specification or some other open source developers may say mildly disapproving things about you on the internet"), we'd instead be offering them the easy way out of letting the front-end provided build context deal with all the messy encoding issues. Taking that approach of just defining a helper API and expecting build backends to either use it or emulate it gives us some quite attractive properties: - backends themselves deal entirely in Unicode, not bytes - frontends get full control of the communication format used between the frontend and its backend shim - they're not restricted to plain text - the Python 2/3 differences can be handled in the frontend CLI shims, rather than every backend needing to do it - we don't need to enshrine any particular encoding handling behaviour in the spec, we can let it be a quality of implementation issue for the front-end tools - platform specific tools can make platform specific choices - tools can adapt to new platforms without requiring a specification update - tools can update their default behaviour as other considerations change (e.g. the possible introduction of locale coercion and PYTHONUTF8 mode in 3.7) Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig