Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On Thu, May 25, 2017, at 03:38 PM, Nick Coghlan wrote: > Seeing it like this pushes me from "Eh, maybe?" to "No, definitely not" > [on the log directory] :) That's fine by me. It does feel like unwanted extra complexity for both backends and frontends. And backends dealing with output in an unknown encoding can still choose to write it to a file and log "full output in /tmp/blah" if they want - they don't need a spec for that. Thomas ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 25May2017 0756, Paul Moore wrote: On 25 May 2017 at 15:38, Nick Coghlan wrote: So I'm inclined to accept the encoding amendment, and then provisionally accept the overall PEP pending implementation in pip. Me too. (Assuming I understand Steve's comments on backends, and he's comfortable with the idea that backends need to capture and manage MSVC output for presentation to the frontend). Sounds like you understood my comments :) +1 overall (-0 on a formal way to pass logs via the disk) As I mentioned at one point, there's a bug against the CPython test suite that the distutils tests show too much console output, which is because distutils currently just lets MSVC write directly to the console. To fix it, we need to capture the output and then conditionally display it, at which point transcoding from ANSI to UTF-8 with 'replace' is trivial, and saves the front end (in this case, the test suite) from having to guess. So it is something that the backend around MSVC needs to do regardless, and if the PEP says "send me UTF-8" then it's one less thing for the backend developer to guess. Cheers, Steve ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 25 May 2017 at 15:38, Nick Coghlan wrote: > Seeing it like this pushes me from "Eh, maybe?" to "No, definitely not" :) Agreed. Given that it's stated as optional for frontends to support it, I'd be arguing against pip bothering (as it seems like too much complexity) - so I'd rather leave it out until another frontend comes along. If at that point there's a need, we can always revise the PEP. > So I'm inclined to accept the encoding amendment, and then > provisionally accept the overall PEP pending implementation in pip. Me too. (Assuming I understand Steve's comments on backends, and he's comfortable with the idea that backends need to capture and manage MSVC output for presentation to the frontend). Paul ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 26 May 2017 at 00:04, Thomas Kluyver wrote: > On Thu, May 25, 2017, at 02:27 PM, Paul Moore wrote: >> I'd be concerned here that we risk making the frontend UI a lot more >> complex for little actual benefit. I'd rather we stick with the >> current model, where a backend just has some output to pass through to >> the frontend. Let's get a solution that works for that before adding >> extra complexity, or we'll never get the PEP signed off. > > I'm inclined to agree that we're overcomplicating things. But if we > can't agree on which simple-but-imperfect option to take, maybe it's > worth trying to work out something more complex. > > My proposed addition to the PEP so far says this: > > The build frontend may capture stdout and/or stderr from the backend. If > the backend detects that an output stream is not a terminal/console > (e.g. ``not sys.stdout.isatty()``), it SHOULD ensure that any output it > writes to that stream is UTF-8 encoded. The build frontend MUST NOT fail > if captured output is not valid UTF-8, but it MAY not preserve all the > information in that case (e.g. it may decode using the *replace* error > handler in Python). If the output stream is a terminal, the build > backend is responsible for presenting its output accurately, as for any > program running in a terminal. > > We could add a paragraph like this: > > The backend may do some operations, such as running subprocesses, which > produce output in an unknown encoding. To handle such output, the build > frontend MAY (?) create an empty directory, and set the environment > variable PEP517_BUILD_LOGS to the path of this directory for the > backend. If this environment variable is set, the backend MAY create any > number of files inside this directory containing additional output. This > is designed to allow the use of encoding detection tools on this output. > If files are created in this directory, frontends SHOULD display its > location in their output, and MAY display the contents of the files. Seeing it like this pushes me from "Eh, maybe?" to "No, definitely not" :) So that gets us to the point where we're agreeing that your suggested addition to the PEP is basically right, with the only remaining question being whether or not we're happy with the section that says "it SHOULD ensure that any output it writes to that stream is UTF-8 encoded". For a Python with locale coercion enabled, we're going to get that by default, so such environments will comply without backend developers doing anything in particular. Frontends may also decide to implement their own PEP 538 style locale coercion for the backend build process when they're running in a non-UTF-8 locale - specifying UTF-8 as a SHOULD in the PEP gives them implied permission to do that. So I don't think this is going to place any undue burden on backend developers for *nix systems - frontends will probably want to implement PEP 538 style locale coercion for LC_CTYPE to handle cases where tools rely on the default stream encoding, but I think that's fine. That leaves Windows, and there I'm prepared to defer to Steve Dower's opinion that it's better to deal with the encoding challenges of consuming the output from MSVC in the build backend, rather than expecting the frontend to deal with it. We also have a precedent now in pip's legacy subprocess handling for what doing that reliably looks like, so it shouldn't be hard for backend implementors to re-use that approach as needed. So I'm inclined to accept the encoding amendment, and then provisionally accept the overall PEP pending implementation in pip. I'll give others a couple of days to comment further, but assuming nothing else comes up, I'll go ahead and do that on the weekend :) Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On Thu, May 25, 2017, at 02:27 PM, Paul Moore wrote: > I'd be concerned here that we risk making the frontend UI a lot more > complex for little actual benefit. I'd rather we stick with the > current model, where a backend just has some output to pass through to > the frontend. Let's get a solution that works for that before adding > extra complexity, or we'll never get the PEP signed off. I'm inclined to agree that we're overcomplicating things. But if we can't agree on which simple-but-imperfect option to take, maybe it's worth trying to work out something more complex. My proposed addition to the PEP so far says this: The build frontend may capture stdout and/or stderr from the backend. If the backend detects that an output stream is not a terminal/console (e.g. ``not sys.stdout.isatty()``), it SHOULD ensure that any output it writes to that stream is UTF-8 encoded. The build frontend MUST NOT fail if captured output is not valid UTF-8, but it MAY not preserve all the information in that case (e.g. it may decode using the *replace* error handler in Python). If the output stream is a terminal, the build backend is responsible for presenting its output accurately, as for any program running in a terminal. We could add a paragraph like this: The backend may do some operations, such as running subprocesses, which produce output in an unknown encoding. To handle such output, the build frontend MAY (?) create an empty directory, and set the environment variable PEP517_BUILD_LOGS to the path of this directory for the backend. If this environment variable is set, the backend MAY create any number of files inside this directory containing additional output. This is designed to allow the use of encoding detection tools on this output. If files are created in this directory, frontends SHOULD display its location in their output, and MAY display the contents of the files. That's not a massive amount more complexity for the spec, but it does add a moderate burden to frontend & backend implementations which want to properly support it. If you're being purist about it, displaying a path on a Unix based system is producing output in an unknown encoding, since filenames in Unix are bytes. I don't imagine many tools are going to go that far, though. Thomas ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 25 May 2017 at 13:26, Nick Coghlan wrote: > On 24 May 2017 at 20:29, Thomas Kluyver wrote: >> Nick: >>> That's actually pretty similar to the way tools like mock (the chroot >>> based RPM builder) work. That way, build backends could choose >>> between: >>> >>> - use pipes to stream output from the tools they call, deal with >>> encoding issues themselves >>> - redirect output to a suitable named file in the tool log directory >> >> Do you know if that system works well for mock? Shall I try to draft a >> spec of something like this for PEP 517? > > I'm genuinely unsure. The main downside of the directory based > approach is that it doesn't play well with CI systems in general - > those are typically set up to capture the standard streams, and if you > want to capture other artifacts, you either have to stream them > anyway, or else you have to use a CI specific upload mechanism to keep > them around. > > I guess what we could do is have a "debug log directory" as part of > the defined interface between the frontends and the build backends, > and then the exact UX of dealing with those build logs would then be > something for frontends to define (e.g. offering an option to > automatically stream the logs after a failed build, with appropriate > headers and footers around each file) To me, this feels like a lot of potentially unnecessary complexity. At the moment pip's UI works around "run a build, get some output, display the output if the situation warrants (i.e., there was an error)". The only stumbling block is over transferring that output from backend to frontend where we need to consider text/bytes issues. We're now talking about potentially managing a directory containing logs, do we need to persist log files, should we display the file content or just the filename, etc. I'd be concerned here that we risk making the frontend UI a lot more complex for little actual benefit. I'd rather we stick with the current model, where a backend just has some output to pass through to the frontend. Let's get a solution that works for that before adding extra complexity, or we'll never get the PEP signed off. Paul ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
FWIW, I was just reading an article about writing libraries to just operate on streams and totally ignore stdout/stdin/file io, and just leave the IO to something else. It may be a good idea to define the spec as purely operating on byte and text streams, then leave where those streams go as an implementation detail. That way for CI systems they could dump to stdout/stdin and other systems could do something different. -W On Thu, May 25, 2017, 7:27 AM Nick Coghlan wrote: > On 24 May 2017 at 20:29, Thomas Kluyver wrote: > > Nick: > >> That's actually pretty similar to the way tools like mock (the chroot > >> based RPM builder) work. That way, build backends could choose > >> between: > >> > >> - use pipes to stream output from the tools they call, deal with > >> encoding issues themselves > >> - redirect output to a suitable named file in the tool log directory > > > > Do you know if that system works well for mock? Shall I try to draft a > > spec of something like this for PEP 517? > > I'm genuinely unsure. The main downside of the directory based > approach is that it doesn't play well with CI systems in general - > those are typically set up to capture the standard streams, and if you > want to capture other artifacts, you either have to stream them > anyway, or else you have to use a CI specific upload mechanism to keep > them around. > > I guess what we could do is have a "debug log directory" as part of > the defined interface between the frontends and the build backends, > and then the exact UX of dealing with those build logs would then be > something for frontends to define (e.g. offering an option to > automatically stream the logs after a failed build, with appropriate > headers and footers around each file) > > Cheers, > Nick. > > -- > Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia > ___ > Distutils-SIG maillist - Distutils-SIG@python.org > https://mail.python.org/mailman/listinfo/distutils-sig > ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 24 May 2017 at 20:29, Thomas Kluyver wrote: > Nick: >> That's actually pretty similar to the way tools like mock (the chroot >> based RPM builder) work. That way, build backends could choose >> between: >> >> - use pipes to stream output from the tools they call, deal with >> encoding issues themselves >> - redirect output to a suitable named file in the tool log directory > > Do you know if that system works well for mock? Shall I try to draft a > spec of something like this for PEP 517? I'm genuinely unsure. The main downside of the directory based approach is that it doesn't play well with CI systems in general - those are typically set up to capture the standard streams, and if you want to capture other artifacts, you either have to stream them anyway, or else you have to use a CI specific upload mechanism to keep them around. I guess what we could do is have a "debug log directory" as part of the defined interface between the frontends and the build backends, and then the exact UX of dealing with those build logs would then be something for frontends to define (e.g. offering an option to automatically stream the logs after a failed build, with appropriate headers and footers around each file) Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On Wed, May 24, 2017, at 01:22 AM, Chris Jerdonek wrote: > 1) Would it make sense to provide a way for build tools to specify > what encoding they use (e.g. if not using the default), instead of > changing their encoding to conform to a standard? It seems like that > could be easier, although I know this doesn't address problems like > non-conforming tools. Interesting idea, but I'm not convinced it actually makes anything easier. You still have the same issues if the backend runs a subprocess which doesn't produce output in the expected encoding. And there would be some small amount of added complexity to communicate the encoding to the frontend. > 2) In terms of debugging, in cases where there are encoding-related > errors, it would help if the overall system made it easy to pinpoint > which parts of the system are at fault (using good error handling, > diagnostic messages, etc). Agreed. Nick: > That's actually pretty similar to the way tools like mock (the chroot > based RPM builder) work. That way, build backends could choose > between: > > - use pipes to stream output from the tools they call, deal with > encoding issues themselves > - redirect output to a suitable named file in the tool log directory Do you know if that system works well for mock? Shall I try to draft a spec of something like this for PEP 517? Thomas ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 24 May 2017 at 03:04, Thomas Kluyver wrote: > I'll propose a variant of an idea I described already: the frontend > could provide the backend with a fresh temp directory. If the backend > needs to run other processes, it can redirect the output into a file in > that temp directory. Then you have files with an unknown encoding, but > each file will hopefully have one encoding, and you can use a tool like > chardet to guess what it is. That's actually pretty similar to the way tools like mock (the chroot based RPM builder) work. That way, build backends could choose between: - use pipes to stream output from the tools they call, deal with encoding issues themselves - redirect output to a suitable named file in the tool log directory Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
A couple comments: 1) Would it make sense to provide a way for build tools to specify what encoding they use (e.g. if not using the default), instead of changing their encoding to conform to a standard? It seems like that could be easier, although I know this doesn't address problems like non-conforming tools. 2) In terms of debugging, in cases where there are encoding-related errors, it would help if the overall system made it easy to pinpoint which parts of the system are at fault (using good error handling, diagnostic messages, etc). --Chris On Tue, May 23, 2017 at 10:04 AM, Thomas Kluyver wrote: > On Tue, May 23, 2017, at 04:20 PM, Nick Coghlan wrote: >> Up to this point, I've been in favour of both 1b and 2b, since they're > > Noted. > >> However, I also realised that there's a potential third way to handle >> this problem: design a Python level API that allows front ends to use >> more structured data formats (e.g. JSON) for communication between the >> frontend and their backend shim. >> >> In particular, I'm thinking we could move the current >> "config_settings" dict onto something like a "build context" object >> that, *even in Python 2*, offers a Unicode "outstream" and >> "errstream", which the backend is then expected to use rather than >> writing to sys.stdout/err directly. That context could also provide a >> Python 3 style "run()" API for subprocess invocation that implemented >> the preferred stream handling behaviour for subprocess invocation >> (including applying the "backslashreplace" error handler regardless of >> version) > > I'm not really compelled by this so far: > ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On Tue, May 23, 2017, at 04:20 PM, Nick Coghlan wrote: > Up to this point, I've been in favour of both 1b and 2b, since they're Noted. > However, I also realised that there's a potential third way to handle > this problem: design a Python level API that allows front ends to use > more structured data formats (e.g. JSON) for communication between the > frontend and their backend shim. > > In particular, I'm thinking we could move the current > "config_settings" dict onto something like a "build context" object > that, *even in Python 2*, offers a Unicode "outstream" and > "errstream", which the backend is then expected to use rather than > writing to sys.stdout/err directly. That context could also provide a > Python 3 style "run()" API for subprocess invocation that implemented > the preferred stream handling behaviour for subprocess invocation > (including applying the "backslashreplace" error handler regardless of > version) I'm not really compelled by this so far: - It's more complexity for build tools - instead of just producing output as usual, now they have to pass around a context object and direct output to it. - What does the frontend do if there is output on stdout/stderr anyway? Throw it away? Let it go straight to the terminal? Reprimand the backend for not using the streams in the build context? Or try to include it as part of build output anyway? - I don't see how it solves the issue with subprocesses producing unknown encodings. The output bytes still need to be interpreted somehow. I'll propose a variant of an idea I described already: the frontend could provide the backend with a fresh temp directory. If the backend needs to run other processes, it can redirect the output into a file in that temp directory. Then you have files with an unknown encoding, but each file will hopefully have one encoding, and you can use a tool like chardet to guess what it is. Thomas ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 23 May 2017 at 17:16, Nick Coghlan wrote: > Yep, and that's also why I want to avoid trying to use it to improve > the encoding handling situation - pip and other tools have to deal > with the current mess regardless, and there's already likely to be > some significant churn in this space as a result of the changes Victor > and I have proposed for Python 3.7. Encoding issues have been around in pip for many years, with little or no progress. We might be getting a handle on things now (Thomas' initial email in this thread was very timely - the fact that I was in the middle of working on the encoding issue in pip was the only reason I picked up on the need for clarity in the PEP) but I'd be very cautious about saying we've got it solved until we have the latest changes in a released version of pip and we get some feedback (or silence, more likely) from international users. One of the reasons I made the point about ease of testing earlier in the thread is that we've found it's extremely difficult to pin down the root of the reported problems in pip - the route that badly encoded data takes from build tool to pip's output is pretty convoluted. Anything that adds clear-cut boundaries at which we can make guarantees about the integrity of the data will help a lot with this in the future. Paul ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 24 May 2017 at 01:39, Paul Moore wrote: > On 23 May 2017 at 16:20, Nick Coghlan wrote: >> Taking that approach of just defining a helper API and expecting build >> backends to either use it or emulate it gives us some quite attractive >> properties: > > Making the output data part of a structured API (and by implication, > saying that backends shouldn't be writing to stdout directly at all) > would definitely improve the situation, IMO. Frankly, it seems likely > that the only real way we're going to get backend developers to > consider encodings is by having the "build output" as a string value > passed back via the API, rather than implied in the fact that backends > can write to stdout/err. It also squarely places the responsibility > for dealing with the question of displaying full-range Unicode output > to the user onto the frontend. > > However, it's a relatively big change to the PEP and there's a risk > that by endlessly reaching for perfection, we miss the chance to get > the PEP in at all (another lesson we should probably learn from PEP > 426!) Yep, and that's also why I want to avoid trying to use it to improve the encoding handling situation - pip and other tools have to deal with the current mess regardless, and there's already likely to be some significant churn in this space as a result of the changes Victor and I have proposed for Python 3.7. As a result, I think adding in additional requirements here runs a significant risk of requiring build backend developers to do additional work to achieve nominal spec compliance without actually simplifying anything in practice for frontend developers. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 23 May 2017 at 16:20, Nick Coghlan wrote: > Taking that approach of just defining a helper API and expecting build > backends to either use it or emulate it gives us some quite attractive > properties: Making the output data part of a structured API (and by implication, saying that backends shouldn't be writing to stdout directly at all) would definitely improve the situation, IMO. Frankly, it seems likely that the only real way we're going to get backend developers to consider encodings is by having the "build output" as a string value passed back via the API, rather than implied in the fact that backends can write to stdout/err. It also squarely places the responsibility for dealing with the question of displaying full-range Unicode output to the user onto the frontend. However, it's a relatively big change to the PEP and there's a risk that by endlessly reaching for perfection, we miss the chance to get the PEP in at all (another lesson we should probably learn from PEP 426!) Paul ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 23 May 2017 at 22:41, Thomas Kluyver wrote: > On Tue, May 23, 2017, at 12:56 PM, Paul Moore wrote: > Can I take a quick poll of what people following this topic think? > > Q1: Default encoding for captured build stdout/stderr > a. UTF-8 (consistent, can represent any character) > b. Locale default (convenient if backend runs subprocesses which produce > output in the locale encoding) > > Q2: Handling unknown encodings from subprocesses > a. Backend should ensure all output is valid in the target encoding > (Q1), though it may not be accurate. > b. Unknown output may be passed on as bytes without transcoding, so the > frontend can e.g. dump it to a file. Up to this point, I've been in favour of both 1b and 2b, since they're the main options that allow a build backend to get itself out of the way entirely and let the front-end deal with the problem rather than having to figure out encoding issues for themselves. pip's already has to deal with the "arbitrarily encoded data" problem for the current setup.py invocation, and whatever solution is adopted there should suffice for PEP 517 as well. If PEP 426 taught me anything, it was that if you weren't planning to write something yourself, and didn't have the budget to pay someone else to write it for you, your best bet is to adhere as closely to the status quo as you can while still incorporating the 100% essential changes that you actually need. (A Zen of Python style aphorism for that: "The right way and the easy way should be the same way") To be honest, I still think that's likely to be the right way to go for PEP 517, and will take some convincing that we're going to be able to persuade future backend developers that personally couldn't care less about encoding issues to adopt anything more complex. However, I also realised that there's a potential third way to handle this problem: design a Python level API that allows front ends to use more structured data formats (e.g. JSON) for communication between the frontend and their backend shim. In particular, I'm thinking we could move the current "config_settings" dict onto something like a "build context" object that, *even in Python 2*, offers a Unicode "outstream" and "errstream", which the backend is then expected to use rather than writing to sys.stdout/err directly. That context could also provide a Python 3 style "run()" API for subprocess invocation that implemented the preferred stream handling behaviour for subprocess invocation (including applying the "backslashreplace" error handler regardless of version) That way, instead of trying to hit build backend developers with a fairly flimsy stick ("Thou shalt comply with the specification or some other open source developers may say mildly disapproving things about you on the internet"), we'd instead be offering them the easy way out of letting the front-end provided build context deal with all the messy encoding issues. Taking that approach of just defining a helper API and expecting build backends to either use it or emulate it gives us some quite attractive properties: - backends themselves deal entirely in Unicode, not bytes - frontends get full control of the communication format used between the frontend and its backend shim - they're not restricted to plain text - the Python 2/3 differences can be handled in the frontend CLI shims, rather than every backend needing to do it - we don't need to enshrine any particular encoding handling behaviour in the spec, we can let it be a quality of implementation issue for the front-end tools - platform specific tools can make platform specific choices - tools can adapt to new platforms without requiring a specification update - tools can update their default behaviour as other considerations change (e.g. the possible introduction of locale coercion and PYTHONUTF8 mode in 3.7) Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 23 May 2017 at 13:41, Thomas Kluyver wrote: > Can I take a quick poll of what people following this topic think? > > Q1: Default encoding for captured build stdout/stderr > a. UTF-8 (consistent, can represent any character) > b. Locale default (convenient if backend runs subprocesses which produce > output in the locale encoding) > > Q2: Handling unknown encodings from subprocesses > a. Backend should ensure all output is valid in the target encoding > (Q1), though it may not be accurate. > b. Unknown output may be passed on as bytes without transcoding, so the > frontend can e.g. dump it to a file. > > I'm currently 1:a, 2:?a . You probably know this, but I'm 1: b, 2: mild preference for a, but not too bothered. If the answer to 1 is a, though, I strongly prefer 2: a. Paul ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On Tue, May 23, 2017, at 12:56 PM, Paul Moore wrote: > So based on your proposal, won't you introduce similar bugs by using > print() without sorting out encodings? Unless (see below) you assume > that the frontend sorts it out for you. If you strictly follow the locale encoding, you need to sort it out in Python anyway, in case the stdout encoding has been overridden by PYTHONIOENCODING, or PYTHONSTARTUP, or the infernal .pth files. I accept that those are corner cases, though. > Yes, subprocesses that produce a known encoding are trivial to deal > with. But remembering that you *need* to deal with them less so. My > concern here is the same one as you quote above - assuming that > subprocess returns UTF-8 encoded bytes, because it does on Linux and > Mac. I agree, that is a concern. > But if you genuinely don't know (or worse, know there is no consistent > encoding) I'm not sure I see how passing unknown bytes onto the > frontend, which by necessity has less context to guess what those > bytes might mean, is the right answer. The frontend is better able to > know what it wants to *do* with those bytes, but "convert them to text > for the user to see" is the only real answer here IMO (sure, dumping > the raw bytes to a file might be an option, but I imagine it'll be a > relatively uncommon choice). I was indeed thinking of dumping them to a file. It's not very user friendly, but it means the information is there if you need it. I suspect that regardless of the locale, technical information like code and filesystem paths will often contain enough ASCII that a human can interpret them even if non-ASCII characters are wrongly encoded. So I hope that needing to reverse-engineer the encoding will be relatively rare. The appeal of this is that it follows "in the face of ambiguity, refuse the temptation to guess". If the backend guesses the encoding incorrectly, the frontend gets valid UTF-8, but is no better able to display it meaningfully, and you then need to go through decode-encode-decode to recover the original text, even if no data was lost. Another option: if the backend runs a subprocess with unknown output encoding, it redirects that output to a temp file and prints the path in its own output. Then there's a better chance that the unknown encoding is at least consistent within the file, so tools can do encoding detection on it. > At the end of the day, there is no perfect answer here. Someone is > going to have to make a judgement call, and as the PEP author, I guess > that's you. So at this point I'll stop badgering you and leave it up > to you to decide what the consensus is. Thanks for listening to my > points, though. I know what I think, but I don't feel like there's a consensus as yet. Can I take a quick poll of what people following this topic think? Q1: Default encoding for captured build stdout/stderr a. UTF-8 (consistent, can represent any character) b. Locale default (convenient if backend runs subprocesses which produce output in the locale encoding) Q2: Handling unknown encodings from subprocesses a. Backend should ensure all output is valid in the target encoding (Q1), though it may not be accurate. b. Unknown output may be passed on as bytes without transcoding, so the frontend can e.g. dump it to a file. I'm currently 1:a, 2:?a . Thomas ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 23 May 2017 at 12:36, Thomas Kluyver wrote: > As you described earlier, though, even using a locale dependent encoding > doesn't really avoid this problem, because of tools using OEM vs ANSI > codepages on Windows. And if PYTHONIOENCODING is set, Python processes > will use that over the locale encoding. I think we're ultimately better > off specifying a consistent encoding rather than trying to guess about > it. Agreed it doesn't avoid the problem. But it does minimise it. I don't see any huge advantage in having a consistent encoding across platforms though - having a consistent *rule*, yes, but "use the locale encoding" is such a rule as well. > I'm also thinking of all the bugs I've seen (and written) by assuming > open() in text mode defaults to UTF-8 encoding - as it does on the Linux > and Mac computers many open source developers use, but not on Windows, > nor in all Linux configurations. So based on your proposal, won't you introduce similar bugs by using print() without sorting out encodings? Unless (see below) you assume that the frontend sorts it out for you. > So I'd recommend that backends running processes for which they know the > encoding should transcode it to UTF-8. I expect we can make standard > utility functions to wait for a subprocess to finish while reading, > transcoding, and repeating its output. Yes, subprocesses that produce a known encoding are trivial to deal with. But remembering that you *need* to deal with them less so. My concern here is the same one as you quote above - assuming that subprocess returns UTF-8 encoded bytes, because it does on Linux and Mac. > I'm still not sure what the backend should do when it runs something for > which it doesn't know the output encoding. The possibilities are either: > > - Take a best guess and transcode it to UTF-8, which may risk losing > some information, but keeps the output as valid UTF-8 > - Pass through the raw bytes, ensuring that no information is lost, but > leaving it up to the frontend/user to deal with that. There's never a good answer here. The "correct" answer is to do research and establish what encoding the tool uses, but that's often stupidly difficult. But if you genuinely don't know (or worse, know there is no consistent encoding) I'm not sure I see how passing unknown bytes onto the frontend, which by necessity has less context to guess what those bytes might mean, is the right answer. The frontend is better able to know what it wants to *do* with those bytes, but "convert them to text for the user to see" is the only real answer here IMO (sure, dumping the raw bytes to a file might be an option, but I imagine it'll be a relatively uncommon choice). At the end of the day, there is no perfect answer here. Someone is going to have to make a judgement call, and as the PEP author, I guess that's you. So at this point I'll stop badgering you and leave it up to you to decide what the consensus is. Thanks for listening to my points, though. Paul ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On Tue, May 23, 2017, at 11:04 AM, Paul Moore wrote: > However, if we do this then we have a situation where existing build > tools (compilers, etc) that we have to support still use platform > dependent encodings. That's a reality that we can't wish away. And the > majority of real-life issues reported on pip are with compilation > errors. So do we require backends that run these tools to ensure that > they transcode the output, or do we risk significant output > corruption, because (essentially) every high-bit character in the > compiler output will be replaced as it's invalid UTF-8? As you described earlier, though, even using a locale dependent encoding doesn't really avoid this problem, because of tools using OEM vs ANSI codepages on Windows. And if PYTHONIOENCODING is set, Python processes will use that over the locale encoding. I think we're ultimately better off specifying a consistent encoding rather than trying to guess about it. I'm also thinking of all the bugs I've seen (and written) by assuming open() in text mode defaults to UTF-8 encoding - as it does on the Linux and Mac computers many open source developers use, but not on Windows, nor in all Linux configurations. So I'd recommend that backends running processes for which they know the encoding should transcode it to UTF-8. I expect we can make standard utility functions to wait for a subprocess to finish while reading, transcoding, and repeating its output. I'm still not sure what the backend should do when it runs something for which it doesn't know the output encoding. The possibilities are either: - Take a best guess and transcode it to UTF-8, which may risk losing some information, but keeps the output as valid UTF-8 - Pass through the raw bytes, ensuring that no information is lost, but leaving it up to the frontend/user to deal with that. Thomas ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 23 May 2017 at 09:56, Thomas Kluyver wrote: > I may have missed it, but has anyone proposed what it should do if it > wants to send characters which can't be encoded in the locale encoding? No, it's not been mentioned - the focus has been on running build tools like a compiler. Best answer I can give is to use a (backslash)replace error handler. I agree this is suboptimal, but see below. > Paths on Windows are handled natively as UTF-16, as I understand it, so > it's entirely possible for them to contain characters which can't be > represented in, say, CP1252. Agreed. In practice, the vast bulk of the issues reported for pip seem to be to do with filename characters or localised messages using the ANSI/OEM codepages, though. But I agree that in theory this is an issue. > Given this, and the workarounds Nick has pointed out are necessary for > systems where the locale thinks it's ASCII, I still think that > specifying "UTF-8" is a better option than trying to work with locale > encodings. We're building a new spec for new tools in 2017, let's not > prolong the pain of platform-dependent default encodings further. However, if we do this then we have a situation where existing build tools (compilers, etc) that we have to support still use platform dependent encodings. That's a reality that we can't wish away. And the majority of real-life issues reported on pip are with compilation errors. So do we require backends that run these tools to ensure that they transcode the output, or do we risk significant output corruption, because (essentially) every high-bit character in the compiler output will be replaced as it's invalid UTF-8? I agree 100% that UTF-8 is in theory the right thing. My focus is on the practical aspects of minimising the risks of repeating the sorts of actual issues that we have seen in the past on pip, though, and "don't require backends that run compilers to transcode the output" seems to me to be the most likely route to achieve that. Having said that, I won't be the one writing those backends - if people like Steve are OK with transcoding (or dealing with pip issues saying "I can't read the compiler output" being passed back to them as backend issues) then I'm not going to argue against UTF-8. Paul ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On Tue, May 23, 2017, at 09:08 AM, Paul Moore wrote: > I strongly > prefer using the locale encoding as the assumed encoding for the > output stream rather than UTF-8. I may have missed it, but has anyone proposed what it should do if it wants to send characters which can't be encoded in the locale encoding? Paths on Windows are handled natively as UTF-16, as I understand it, so it's entirely possible for them to contain characters which can't be represented in, say, CP1252. Given this, and the workarounds Nick has pointed out are necessary for systems where the locale thinks it's ASCII, I still think that specifying "UTF-8" is a better option than trying to work with locale encodings. We're building a new spec for new tools in 2017, let's not prolong the pain of platform-dependent default encodings further. Thomas ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 23 May 2017 at 05:11, Nick Coghlan wrote: > What we can then also do is to recommend that *front-ends* do the > following when invoking their build backend CLI shims: > > 1. Implement the C locale -> UTF-8 based locale coercion defined in > PEP 538 when launching the subprocess > 2. Implement a similar coercion for Windows, where cp1252 being active > in the parent process prompts a call to "'chcp cp65001'" inside the > subprocess before the build backend itself actually starts running I'm a fairly strong -1 on doing "chcp 65001" on Windows. It puts the backend into a position of running under a relatively non-standard environment, and therefore runs the risk of provoking issues. If a build tool has issues as a result of the changed codepage, who's responsible for dealing with the bug? The backend, that manages the tool, or the frontend, that set the codepage? One of the big issues we have is that very few people have expertise in this area (encodings on Windows) and so keeping the environment as "standard" as we can ensures that we can make best use of the limited expertise we have. I agree with Thomas that we're probably reaching a point of diminishing returns. But on one point I remain in dispute - I strongly prefer using the locale encoding as the assumed encoding for the output stream rather than UTF-8. Also (although this is a quality of implementation issue) I think that the frontend (i.e.the shim) should *not* make any changes to the global environment that the backend runs in. Paul ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 23 May 2017 at 03:38, Steve Dower wrote: > Okay, I think I get the problem now. We expect backends to let child > subprocesses just spit out whatever *they* want onto the same stdout/stderr. > > I'm really not a fan of forcing front ends to clean up that mess, and so I'd > still suggest that the backend "tool" be a script to launch the actual tool > and do the conversion to UTF-8. One of the key premises of PEP 517 is that there will be relatively few front ends (pip, possibly easy_install, ???), but a relatively large number of backends (one per build system - at least distutils/setuptools, distutils2, flit, encons, likely eventually meson, waf, and yotta, and potentially even C/C++ build systems like autotools, CMake, etc). So it makes sense to put the implementation burden for important aspects of the UX on the part that PyPA has the most influence over (the front-end), rather than considering it reasonable for front-end developers to point fingers and say "That UX failure in the tool we provide isn't *our* fault, it's the fault of the build backend developers for not complying with the interoperability specification properly"). Once we make that core assumption about where the responsibility for the end user experience resides, then the absolutely *minimum* behavioural requirements that can be placed on build backends are: - respect the locale encoding - emit informational messages on stdout - emit error messages on stderr What we can then also do is to recommend that *front-ends* do the following when invoking their build backend CLI shims: 1. Implement the C locale -> UTF-8 based locale coercion defined in PEP 538 when launching the subprocess 2. Implement a similar coercion for Windows, where cp1252 being active in the parent process prompts a call to "'chcp cp65001'" inside the subprocess before the build backend itself actually starts running That leaves build backend authors with the freedom to assume that they *don't* need to worry about stream encoding issues, since giving them access to properly configured streams is the front end's responsibility. > Perhaps the middle ground is to specify encoding='utf-8', errors='anything > but strict' for front-ends, and well-behaved backends should do the work to > transcode when it is known to be necessary for the tools they run. (i.e. > frontends do not crash, backends have a simple rule for avoiding loss of > data). In PEP 517's architecture, the front-end developers are also responsible for the CLI that's running inside the backend subprocess. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On Mon, May 22, 2017, at 11:36 PM, Steve Dower wrote: > IMHO, #2 is definitely the right way to go. Yes, the platform specific > code now has to worry about the encoding, but... the encoding is > platform specific? So... that seems exactly right? :) Maybe I'm still > missing something here, but I'm totally happy to leave it to Thomas to > decide (which I think he has, but I haven't gotten to looking at that PR > yet). I think I broadly agree with this as well. My reservation is that the build backend might be running a subprocess which produces output in an *unknown* encoding, especially if it allows the package author or the end user to configure a command to run. If it doesn't know the encoding, I'd rather get the raw bytes from the subprocess in the log (e.g. dumped to a file), rather than attempting to transcode them to UTF-8 - the conversion risks losing information, and even if it doesn't, it makes it harder to work out what was really meant. I feel like we're spending a lot of energy on a point that's not really central to the PEP, though. I think we've established that there's a potential for bugs and mojibake whatever we put in the spec. So I'd like to put something relatively simple and move on. I still stand by my PR, which amounts to "backends try to make it UTF-8, frontends don't crash if it isn't". I might be persuaded to add a recommendation that frontends dump the bytes to a file if they're not UTF-8, so the user can pull it apart if necessary. Thomas ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 22May2017 1253, Paul Moore wrote: It seems to me there are 2 schools of thought: 1. There are likely to be fewer front ends than back ends, and so the front end(s) (basically, pip) should deal with the problem. Also, backends are more likely to be written by developers who are looking at very specific scenarios, and asking them to handle all the complexities of robust multilingual coding is raising the bar on writing a backend too high. 2. The backend is where the problem lies, and so the backend should address the issue. Furthermore, a well-established principle in dealing with encodings is to convert to strings right at the boundary of the application, and in this case the backend is the only code that has access to that boundary. (I tend towards (2), but I honestly can't say to what extent that's because it makes it "someone else's problem" for me ;-)) I also tend towards 2, and I assume I am one of the more likely people to write the part that invokes Microsoft's cl.exe/link.exe :) Is the front end going to be directly invoking those tools? I would assume not, otherwise it won't be cross platform. Since the shim belongs to the front end, I've essentially been ignoring it. The shim can invoke another part of the build tool, but that is not going to be cl.exe/link.exe either. At some point there will be a script that runs the tools directly. I have been referring to that as the backend, and it is the part that should handle capturing and transcoding the output. Everything from there can be utf8:replace to prevent crashing, but we can't say "the frontend can handle all encodings", and shouldn't say "the frontend will only use bad encodings". IMHO, #2 is definitely the right way to go. Yes, the platform specific code now has to worry about the encoding, but... the encoding is platform specific? So... that seems exactly right? :) Maybe I'm still missing something here, but I'm totally happy to leave it to Thomas to decide (which I think he has, but I haven't gotten to looking at that PR yet). Cheers, Steve ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 22 May 2017 at 18:38, Steve Dower wrote: > Okay, I think I get the problem now. We expect backends to let child > subprocesses just spit out whatever *they* want onto the same stdout/stderr. s/expect/allow/ The paranoid in me suspects "expect" is also true, though :-) > I'm really not a fan of forcing front ends to clean up that mess, and so I'd > still suggest that the backend "tool" be a script to launch the actual tool > and do the conversion to UTF-8. What you're referring to as the backend "tool" being a script, is what the PEP refers to as a "shim" (as Nick pointed out to me) and is considered part of the front end. The back end is a set of Python APIs which are called by the front end (in any real life front end, via the front end's shim script). > Perhaps the middle ground is to specify encoding='utf-8', errors='anything > but strict' for front-ends, and well-behaved backends should do the work to > transcode when it is known to be necessary for the tools they run. (i.e. > frontends do not crash, backends have a simple rule for avoiding loss of > data). For front ends, "never crash" is essential. But "produce as readable as possible data" is also a high priority. Consider for example a Russian user with a series of directories named in Russian. If the tools write an error using his local 8-bit encoding, and the front end assumes UTF-8, then all of the high-bit characters in his directory names would be replaced. Deciphering an error message like "File ???/?/?.c: unexpected EOF" is problematic... :-( The model assumes that most front-ends would call the backend via a subprocess "shim" that was maintained by the front end project. But the expectation here seems to be that the backend is allowed to write directly to the stdio streams of its process (or at least, to let the tools it calls do so). So the shim *cannot* control the encoding of the data received by the frontend, and so the encoding has to be agreed between backend and frontend. The basic question is how the responsibility for dealing with data in an uncertain encoding is allocated. It seems to me there are 2 schools of thought: 1. There are likely to be fewer front ends than back ends, and so the front end(s) (basically, pip) should deal with the problem. Also, backends are more likely to be written by developers who are looking at very specific scenarios, and asking them to handle all the complexities of robust multilingual coding is raising the bar on writing a backend too high. 2. The backend is where the problem lies, and so the backend should address the issue. Furthermore, a well-established principle in dealing with encodings is to convert to strings right at the boundary of the application, and in this case the backend is the only code that has access to that boundary. (I tend towards (2), but I honestly can't say to what extent that's because it makes it "someone else's problem" for me ;-)) As you say, the middle ground here is that front ends must never crash, and back ends should (but aren't required to) produce output in a specified encoding (I still prefer the locale encoding as that has the best chance of avoiding the / issue). That's more or less what pip has to deal with now (and not that far off (1)), and my current attempt to address that situation is at https://github.com/pypa/pip/pull/4486 for what it's worth. A couple of final thoughts. I would expect that testing the handling of encodings is likely to be an important issue (at least, I expect there'll be bugs, and adding tests to make sure they get properly fixed will be important). Handling tool output encoding in the backend is likely to involve relatively low level interface functions, where the inputs and outputs can be relatively easily mocked. So I would expect backend unit testing of encoding handling would be relatively straightforward. Conversely, testing front end handling of encoding issues is very tricky - it's necessary to set up system state to persuade the build tools to produce the data you want to test against (it feels like integration testing rather than unit testing). Also, fixing encoding issues in the backend decouples the fix from pip's release cycle, which is probably a good thing (unless the backend is not well maintained, but that's an issue in itself). Paul ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 22May2017 0803, Paul Moore wrote: On 22 May 2017 at 15:23, Nick Coghlan wrote: No, that's discussed here: https://www.python.org/dev/peps/pep-0517/#comparison-to-competing-proposals Even though PEP 517 defines a Python API for build backends to implement, it still expects installation tools to wrap a subprocess call around the backend invocation. OK, but is it not acceptable for the child cmdline process (owned by pip) to capture the backend implementation's stdout using reassignment of sys.stdout? I assume, from your response, that it's *not* acceptable to do that - but that needs to be documented somewhere. Specifically, that the child cmdline is not allowed to do something like: out = io.StringIO sys.stdout = out build_backend.hook() print(out.getvalue(), encoding="UTF-8") (Which would otherwise be a very simple way to get guaranteed UTF-8 as the encoding across the process boundary - but it does so by imposing basically the rules I stated on the backend). Okay, I think I get the problem now. We expect backends to let child subprocesses just spit out whatever *they* want onto the same stdout/stderr. I'm really not a fan of forcing front ends to clean up that mess, and so I'd still suggest that the backend "tool" be a script to launch the actual tool and do the conversion to UTF-8. Perhaps the middle ground is to specify encoding='utf-8', errors='anything but strict' for front-ends, and well-behaved backends should do the work to transcode when it is known to be necessary for the tools they run. (i.e. frontends do not crash, backends have a simple rule for avoiding loss of data). Cheers, Steve ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 22 May 2017 at 15:23, Nick Coghlan wrote: > No, that's discussed here: > https://www.python.org/dev/peps/pep-0517/#comparison-to-competing-proposals > > Even though PEP 517 defines a Python API for build backends to > implement, it still expects installation tools to wrap a subprocess > call around the backend invocation. OK, but is it not acceptable for the child cmdline process (owned by pip) to capture the backend implementation's stdout using reassignment of sys.stdout? I assume, from your response, that it's *not* acceptable to do that - but that needs to be documented somewhere. Specifically, that the child cmdline is not allowed to do something like: out = io.StringIO sys.stdout = out build_backend.hook() print(out.getvalue(), encoding="UTF-8") (Which would otherwise be a very simple way to get guaranteed UTF-8 as the encoding across the process boundary - but it does so by imposing basically the rules I stated on the backend). > That said, the whole "The build backend still runs in a subprocess" > aspect should probably be separated out into its own section > "Isolating build backends from frontend process state", rather than > solely being covered in the "Comparison to PEP 516?" section, as it's > a key aspect of the design - we expect each installation tool to > provide its own CLI shim for calling build backends, rather than > requiring all installation tools to use the same one. Strong +1. And that section needs to be very clear on issues like this, covering what the shim is allowed to do. As the point of the shim is to protect the backend from frontend state, I'm OK with the general principle that the shim must do "as little as possible" before calling the hook - but "reset sys.stdout to protect against encoding errors" could easily be seen as within the realm of acceptable behaviour (as it stops hooks writing arbitrary Unicode to a standard output that the shim knows is limited). I'm happy enough with the idea that pip won't do anything silly in its CLI shim, but we don't want to get into the "implementation as the standard" situation where a backend is allowed to do anything that pip's shim can cope with... Paul ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 22 May 2017 at 23:15, Paul Moore wrote: > On 22 May 2017 at 12:28, Thomas Kluyver wrote: >> What if it wants to send a character which can't be encoded in the >> locale encoding? It's quite easy on Windows to end up with a character >> that you can't encode as cp1252. If the build tool uses .encode(loc_enc, >> 'replace'), then you've lost information even before it gets to the >> install tool. >> >> It's 2017, I really don't want to go down the 'locale specified >> encoding' route again. UTF-8 everywhere! > > Hang on. Can we take a step back here? I just re-read the PEP and > remembered (!) that hooks are *in-process* Python entry points (I've > been working with pip's current backend-as-subprocess model, and mixed > up in my mind the original 2 proposals here). I think this encoding > debate may be a red herring. No, that's discussed here: https://www.python.org/dev/peps/pep-0517/#comparison-to-competing-proposals Even though PEP 517 defines a Python API for build backends to implement, it still expects installation tools to wrap a subprocess call around the backend invocation. Frontends needs to do that in order to protect *their own* process state from bugs and design quirks in backend implementations: - no monkeypatching of parent process modules - no changes to the standard stream configuration - no persistent locale changes - no environment variable changes - no manipulation of any other process global state - calling sys.exit() won't cryptically crash the entire installation process - memory leaks won't cryptically crash the entire installation process - infinite loops won't *necessarily* crash the entire installation process (if the build has a timeout on it) - installation tools running with elevated privileges can readily run the build process with reduced privileges - installation tools can also readily run the build process in a chroot or containerised environment And in the context of this thread, it gives the frontend complete control over the stream output from not only the backend itself, but any child processes that it launches. That said, the whole "The build backend still runs in a subprocess" aspect should probably be separated out into its own section "Isolating build backends from frontend process state", rather than solely being covered in the "Comparison to PEP 516?" section, as it's a key aspect of the design - we expect each installation tool to provide its own CLI shim for calling build backends, rather than requiring all installation tools to use the same one. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 22 May 2017 at 12:28, Thomas Kluyver wrote: > What if it wants to send a character which can't be encoded in the > locale encoding? It's quite easy on Windows to end up with a character > that you can't encode as cp1252. If the build tool uses .encode(loc_enc, > 'replace'), then you've lost information even before it gets to the > install tool. > > It's 2017, I really don't want to go down the 'locale specified > encoding' route again. UTF-8 everywhere! Hang on. Can we take a step back here? I just re-read the PEP and remembered (!) that hooks are *in-process* Python entry points (I've been working with pip's current backend-as-subprocess model, and mixed up in my mind the original 2 proposals here). I think this encoding debate may be a red herring. If a hook is being called as a Python method call, then it can print what it likes to stdout and stderr. And it's the backend's responsibility to ensure that it never fails when printing - so the *backend* has to deal with the fact that anything it wants to print must be representable in sys.stdout.encoding, with the default (raise an exception) error handling. Given this fact, and the fact that sys.stdout and sys.stderr are *text* output streams, build frontends like pip can reasonably just replace sys.std{out,err} (for example with a StringIO object) to get hook output. There's no encoding issue for frontends, they just capture the text sent to the stdio streams. The rules needed for *backends* are then: 1. Backends MUST NOT write to raw IO channels, all output MUST go via sys.stdout and sys.stderr. Build frontends MAY redirect these streams to post-process them, but are not required to do so. As a consequence: 1a. Backends MUST be prepared to deal with the possibility that those IO streams have the limitations of the platform IO streams (e.g., limited subset of Unicode allowed, fails with an exception when invalid characters are written). 1b. Backends MUST capture and manage the output from any subprocesses they spawn (so that they can follow the other rules). 1c. Backends cannot assume that they can write output that the user will see - frontends may suppress or modify any output passed on stdout. Conversely, backends should not bypass the ability of frontends to capture stdout, as frontends are responsible for user interaction. Some of those MUSTs could be replaced by SHOULD, if we want to allow backends to write directly to the screen. But that is likely to corrupt the UI of the frontend, so I'm inclined to say that we don't allow that. Paul ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 22 May 2017 at 21:28, Thomas Kluyver wrote: > On Mon, May 22, 2017, at 12:02 PM, Paul Moore wrote: >> The only reservation I have is that the choice of UTF-8 means that on >> Windows, build backends pretty much have to explicitly manage tool >> output (as they are pretty much certain *not* to output in UTF-8). >> Build backend writers that aren't aware of this issue (most likely >> because their main platform is not Windows) could very easily choose >> to just pass through the raw bytes, and as a result *all* non-ASCII >> output would be garbled on non-UTF-8 systems. >> >> Would locale.getpreferredencoding() not be a better choice here? I >> know it has issues in some situations on Unix, but are they worse than >> the issues UTF-8 would cause on Windows? After all it's the encoding >> used by subprocess.Popen in "universal newlines" mode... > > What if it wants to send a character which can't be encoded in the > locale encoding? It's quite easy on Windows to end up with a character > that you can't encode as cp1252. If the build tool uses .encode(loc_enc, > 'replace'), then you've lost information even before it gets to the > install tool. The counterargument is that there's plenty of text that *can* be correctly encoded in cp1252 (especially in Europe and LATAM) that will be rendered incorrectly if the installation tool attempts to interpret it as UTF-8. CPython itself will also display explicitly UTF-8 encoded text incorrectly on a Windows console in versions prior to 3.6. > It's 2017, I really don't want to go down the 'locale specified > encoding' route again. UTF-8 everywhere! "UTF-8 everywhere" is fine for network services that only need to talk to other network services, command line applications, and web browsers, but even in 2017 it's still a problematic assumption on client devices running Windows or Linux. Rather than the locale specified encoding being broken in general, the key recurring problem we've found with it on *nix systems relates to the fact that glibc still defaults to ASCII in the C locale - "assume ASCII really means UTF-8" is enough to solve that problem *without* breaking compatibility with cp1252 and non-UTF-8 universal encodings. The other recurring problem is cp1252 itself on Windows, which suffers from the fact that there isn't a nice environment variable based way to change the active code page when invoking a subprocess, and also that cp65001 (the UTF-8 code page) isn't really properly supported in Python 2.7 (although you can inject a custom search function to alias it to utf-8 [1]). Even in that case though, mandating "though shalt treat the streams as UTF-8" in the spec doesn't *solve* those problems - it just means we're specifying a behaviour that we know will provide a poor developer experience on Windows, rather than alerting tool developers to the fact that this is something they're going to need to be aware of. Cheers, Nick. [1] http://neurocline.github.io/dev/2016/10/13/python-utf8-windows.html -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 22 May 2017 at 21:02, Paul Moore wrote: > On 22 May 2017 at 11:22, Thomas Kluyver wrote: >> I have made a PR against the PEP with my best take on the encoding >> situation: >> https://github.com/python/peps/pull/264/files > > LGTM. > > The only reservation I have is that the choice of UTF-8 means that on > Windows, build backends pretty much have to explicitly manage tool > output (as they are pretty much certain *not* to output in UTF-8). > Build backend writers that aren't aware of this issue (most likely > because their main platform is not Windows) could very easily choose > to just pass through the raw bytes, and as a result *all* non-ASCII > output would be garbled on non-UTF-8 systems. > > Would locale.getpreferredencoding() not be a better choice here? I > know it has issues in some situations on Unix, but are they worse than > the issues UTF-8 would cause on Windows? After all it's the encoding > used by subprocess.Popen in "universal newlines" mode... +1 from me for locale.getpreferredencoding() as the default - not only is it a more suitable default on Windows, it's also the best way to do the right thing in GB.18030 locales, and as far as I'm aware, handling that correctly is still a requirement for selling commercial software into China (that's why I chose it as the main non-UTF-8 example encoding in PEP 538). If Python tools want to specifically detect the use of 7-bit ASCII and override *that* to be UTF-8, then the relevant snippet is: def get_stream_encoding(): nominal = locale.getpreferredencoding() if codecs.lookup(nominal).name == "ascii": return "utf-8" return nominal That's effectively the same model that PEP 538 and 540 are proposing be applied by default for the standard streams, so it would also interoperate well with Python 3.7+. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On Mon, May 22, 2017, at 12:02 PM, Paul Moore wrote: > The only reservation I have is that the choice of UTF-8 means that on > Windows, build backends pretty much have to explicitly manage tool > output (as they are pretty much certain *not* to output in UTF-8). > Build backend writers that aren't aware of this issue (most likely > because their main platform is not Windows) could very easily choose > to just pass through the raw bytes, and as a result *all* non-ASCII > output would be garbled on non-UTF-8 systems. > > Would locale.getpreferredencoding() not be a better choice here? I > know it has issues in some situations on Unix, but are they worse than > the issues UTF-8 would cause on Windows? After all it's the encoding > used by subprocess.Popen in "universal newlines" mode... What if it wants to send a character which can't be encoded in the locale encoding? It's quite easy on Windows to end up with a character that you can't encode as cp1252. If the build tool uses .encode(loc_enc, 'replace'), then you've lost information even before it gets to the install tool. It's 2017, I really don't want to go down the 'locale specified encoding' route again. UTF-8 everywhere! One affordance I'd consider is a recommendation to install tools that if captured output is not valid UTF-8, they dump the raw bytes to a file so that no information is lost. I'm not sure if that recommendation needs to be in the spec itself, though. Thomas ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 22 May 2017 at 11:22, Thomas Kluyver wrote: > I have made a PR against the PEP with my best take on the encoding > situation: > https://github.com/python/peps/pull/264/files LGTM. The only reservation I have is that the choice of UTF-8 means that on Windows, build backends pretty much have to explicitly manage tool output (as they are pretty much certain *not* to output in UTF-8). Build backend writers that aren't aware of this issue (most likely because their main platform is not Windows) could very easily choose to just pass through the raw bytes, and as a result *all* non-ASCII output would be garbled on non-UTF-8 systems. Would locale.getpreferredencoding() not be a better choice here? I know it has issues in some situations on Unix, but are they worse than the issues UTF-8 would cause on Windows? After all it's the encoding used by subprocess.Popen in "universal newlines" mode... Paul ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
I have made a PR against the PEP with my best take on the encoding situation: https://github.com/python/peps/pull/264/files On Mon, May 22, 2017, at 11:19 AM, Paul Moore wrote: > On 22 May 2017 at 10:56, Thomas Kluyver wrote: > > On Sat, May 20, 2017, at 07:36 PM, Steve Dower wrote: > >> Require that build tools either send UTF-8 to the UI component, or write > >> bytes to a file and call it a build output. I see no benefit in > >> requiring both the build tool and the UI tool to guess what the text > >> encoding is. > > > > I'm not proposing that the install tool should try to guess the > > encoding, but I think a well written install tool shouldn't crash if the > > build output doesn't match the encoding it expects. Even if the spec > > says that the build output MUST be UTF-8 encoded, build tools can have > > bugs, and you don't want want the install to fail just because the log > > isn't correctly encoded. > > > > Hence, I think a 'SHOULD' is appropriate for this part of the spec: > > > > - To install tool authors, it is clear that they can display the output > > as UTF-8 so long as they don't crash if it's invalid. > > - To build tool authors, it's clear that they can't pass the buck to > > install tool authors if output gets jumbled because it's not UTF-8. > > I'd say that it's not so much just "well written" install tools. I'd > say that install tools MUST NOT crash if build tool output isn't in > the expected encoding. On the other hand, the encoding agreement > implies that if build tools *do* send data in the correct encoding > then they are entitled to expect that it will be displayed accurately > to the end user. > > Output can be garbled in two ways: > > 1. The build tool does not (or cannot) ensure that its output is in > the standard-mandated encoding. > 2. The install tool cannot display the full range of characters > representable in the standard-mandated encoding. > > Neither of these should cause a failure. Well written install tools > should warn in the case of (1) - "I have been passed data that I don't > understand, I'll do my best to display it but can't guarantee the > output won't be garbled". In the case of (2), though, that's "as > expected" - if your OS settings mean you can't display certain > characters, you shouldn't be surprised if your install tool replaces > them with a placeholder. > > On an implementation note, this boils down to something like the > following in the install tool: > > # Step 1 > try: > data = decode build output using STD_ENCODING > except UnicodeDecodeError: > warn "Data is not in expected encoding" > data = decode using STD_ENCODING with errors= replacement> > > # Step 2 > data = data.encode(MY_OUTPUT_ENCODING, errors= replacement>).decode(MY_OUTPUT_ENCODING) > > # We now have subprocess output that's safe to display if requested. > > As a side note, I find step 2 "sanitise my string to ensure it can be > safely output" to be a pretty common operation - possibly because > Python's standard IO streams raise exceptions on unicode errors - and > I'm surprised there isn't a better way to spell it than the > encode/decode pair above. > > Paul ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 22 May 2017 at 10:56, Thomas Kluyver wrote: > On Sat, May 20, 2017, at 07:36 PM, Steve Dower wrote: >> Require that build tools either send UTF-8 to the UI component, or write >> bytes to a file and call it a build output. I see no benefit in >> requiring both the build tool and the UI tool to guess what the text >> encoding is. > > I'm not proposing that the install tool should try to guess the > encoding, but I think a well written install tool shouldn't crash if the > build output doesn't match the encoding it expects. Even if the spec > says that the build output MUST be UTF-8 encoded, build tools can have > bugs, and you don't want want the install to fail just because the log > isn't correctly encoded. > > Hence, I think a 'SHOULD' is appropriate for this part of the spec: > > - To install tool authors, it is clear that they can display the output > as UTF-8 so long as they don't crash if it's invalid. > - To build tool authors, it's clear that they can't pass the buck to > install tool authors if output gets jumbled because it's not UTF-8. I'd say that it's not so much just "well written" install tools. I'd say that install tools MUST NOT crash if build tool output isn't in the expected encoding. On the other hand, the encoding agreement implies that if build tools *do* send data in the correct encoding then they are entitled to expect that it will be displayed accurately to the end user. Output can be garbled in two ways: 1. The build tool does not (or cannot) ensure that its output is in the standard-mandated encoding. 2. The install tool cannot display the full range of characters representable in the standard-mandated encoding. Neither of these should cause a failure. Well written install tools should warn in the case of (1) - "I have been passed data that I don't understand, I'll do my best to display it but can't guarantee the output won't be garbled". In the case of (2), though, that's "as expected" - if your OS settings mean you can't display certain characters, you shouldn't be surprised if your install tool replaces them with a placeholder. On an implementation note, this boils down to something like the following in the install tool: # Step 1 try: data = decode build output using STD_ENCODING except UnicodeDecodeError: warn "Data is not in expected encoding" data = decode using STD_ENCODING with errors= # Step 2 data = data.encode(MY_OUTPUT_ENCODING, errors=).decode(MY_OUTPUT_ENCODING) # We now have subprocess output that's safe to display if requested. As a side note, I find step 2 "sanitise my string to ensure it can be safely output" to be a pretty common operation - possibly because Python's standard IO streams raise exceptions on unicode errors - and I'm surprised there isn't a better way to spell it than the encode/decode pair above. Paul ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On Sat, May 20, 2017, at 07:36 PM, Steve Dower wrote: > Require that build tools either send UTF-8 to the UI component, or write > bytes to a file and call it a build output. I see no benefit in > requiring both the build tool and the UI tool to guess what the text > encoding is. I'm not proposing that the install tool should try to guess the encoding, but I think a well written install tool shouldn't crash if the build output doesn't match the encoding it expects. Even if the spec says that the build output MUST be UTF-8 encoded, build tools can have bugs, and you don't want want the install to fail just because the log isn't correctly encoded. Hence, I think a 'SHOULD' is appropriate for this part of the spec: - To install tool authors, it is clear that they can display the output as UTF-8 so long as they don't crash if it's invalid. - To build tool authors, it's clear that they can't pass the buck to install tool authors if output gets jumbled because it's not UTF-8. Thomas ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 21 May 2017 at 02:36, Steve Dower wrote: > On 20May2017 0820, Nick Coghlan wrote: >> >> Good point regarding the fact that the Windows 16-bit APIs only come >> into play for interactive sessions (even in 3.6+), while for PEP 517 >> we're specifically interested in the 8-bit pipes used to communicate >> with build subprocesses launched by an installation tool. > > > I need to catch up on the PEP (and thanks Brett for alerting me to the > thread), but this comment in particular cements the mental diagram I have > right now: > > (build UI) <--> (build tool) <--> (compiler) > ( Python ) <--> ( Python ) <--> (anything) > > I'll probably read the PEP closely and see that this is entirely incorrect, > but if it's right: > > * encoding for text between the build UI and build tool should just be > specified once for all platforms (i.e. use UTF-8). > * encoding for text between build tool and the compiler depends on the > compiler Alas, it isn't quite that simple. Let's take the current de facto standard case: (user console/CI build log) <-> pip <-> setup.py (distutils/setuptools) <-> 3rd party tool Key usability feature: * when requested, informational messages from 3rd party tools SHOULD be made available to the end user for debugging purposes Ideal outcome: * everything that makes it to the user console or CI build log is readable by the end user Essential requirement: * encoding problems in informational messages emitted by 3rd party tools MUST NOT cause the build to fail Now, the easiest way to handle the essential requirement as the author of an installation or build tool is to choose not to deal with it: instead, you just treat the output from further downstream as opaque binary data, and let the user console/CI build log layer deal with any encoding problems as they see fit. You may end up with some build failures that are a pain to debug because you're getting nonsense from the build pipeline, but you won't fail your build *because* some particular build tool emitted improperly encoded nonsense. That all changes if we *require* UTF-8 on the link between the installation tool (e.g. pip) and the build tool (e.g. setup.py). If we do that: * the installation tool can't just pass along build tool output to the user console or CI build log any more, it has a nominal obligation to try to interpret it as UTF-8 * the build tool (or build tool shim) can't just pass along 3rd party tool output to the installation tool any more, it has a nominal obligation to try to get it to emit UTF-8 Now, *particular* installation and build tools may want to strongly encourage the use of UTF-8 in an effort to get closer to the ideal outcome, but that isn't the key objective of PEP 517: the key objective of PEP 517 is to make it easier to use *general purpose* build systems that happen to be implemented in Python (like waf, scons, and meson) to handle complex build scenarios, while also allowing the use of simpler Python-only build systems (like flit) for distribution of pure Python projects. That said, the PEP *could* explicitly define a short list of behaviours that we consider reasonable in an installation tool: 1. Treat the informational output from the build tool as an opaque binary stream 2. Treat the informational output from the build tool as a text stream encoded using locale.getpreferredencoding(), and decode it using the backslashreplace error handler 3. Treat the informational output from the build tool as a UTF-8 encoded text stream, and decode it using the backslashreplace error handler We'd just need to caveat the latter two options with the fact that they'll give you a cryptic error message on Python 3.4 and earlier (including Python 2): >>> b"\xf0\x01\x02\x03".decode("utf-8", "backslashreplace") Traceback (most recent call last): File "", line 1, in File "/home/ncoghlan/devel/py27/Lib/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) TypeError: don't know how to handle UnicodeDecodeError in error callback I had to look that up on Stack Overflow myself, but what it's trying to say is that until Python 3.5, "backslashreplace" only worked for encoding, not for decoding. That means that for earlier versions, you'd need to define your own custom error handler as described in http://stackoverflow.com/questions/25442954/how-should-i-decode-bytes-using-ascii-without-losing-any-junk-bytes-if-xmlch/25443356#25443356 Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
> On May 20, 2017, at 4:05 PM, Paul Moore wrote: > > I'm a little concerned if we're going to end up with a proposal that > means that distutils is in violation of the spec unless this issue is > fixed. I'm not sure if that's where we're headed, but I wanted to be > clear here - is PEP 517 intended to encompass distutils/setuptools, or > are we treating them as a legacy case, that pip should special-case? I don’t think distutils/setuptools are going to be compatible out of the box anyways, because it’s API is tied to setup.py. Whatever adapter is written to adapt it to PEP 517 can handle any semantic differences as well. — Donald Stufft ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 20May2017 1315, Paul Moore wrote: On 20 May 2017 at 17:36, Steve Dower wrote: In general, since most subprocesses (at least on Windows) do not have customizable encodings, the tool that launches them needs to know what the encoding is. Since we don't live in a Python 3.6 world quite yet, that means the tool should read raw bytes from the compiler and encode them to UTF-8. Did you spot my point that Visual C produces output that's a mixture of OEM and ANSI codepages? [SNIP] Yes, and it's a perfect example of why the MSVC-specific wrapper should be the one to deal with tool encodings. If you forward unencoded bytes like this back to the UI, it will have to deal with the mixed encoding. I'd be very surprised if build tool developers got this sort of edge case correct without at least some guidance from the PEP on the sorts of things they need to consider. You suggest "read raw bytes and encode them to UTF-8" - but you don't encode bytes, you encode strings, so you still need to convert those bytes to a string first, and there's no encoding you can reliably use for this. You need to use "errors=replace" to ensure you can handle inconsistently encoded data without getting an exception. The "read raw bytes and [transcode] them" comment was meant to be that sort of help. I didn't go as far as writing `output.decode(output_encoding, errors="replace").encode("utf-8", errors="replace")`, but that's basically what I meant to imply. The build tool developer is the *only* developer who can get this right, and if they can't, then they have to figure out the most appropriate way to work around the fact that they can't. As for defining distutils as incompatible with the PEP, I'm okay with that. Updating distutils to use subprocess for launching tools rather than spawnv would be a very good start (and likely a good contribution for a new contributor), but allowing build tools to continue to be written badly is not worthwhile. Cheers, Steve ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 20 May 2017 at 17:36, Steve Dower wrote: > In general, since most subprocesses (at least on Windows) do not have > customizable encodings, the tool that launches them needs to know what the > encoding is. Since we don't live in a Python 3.6 world quite yet, that means > the tool should read raw bytes from the compiler and encode them to UTF-8. Did you spot my point that Visual C produces output that's a mixture of OEM and ANSI codepages? The example I used was: OEM code page 850, ANSI codepage 1252 (standard British English Windows) Visual Studio 2015 cl a£b >output.file The output uses CP850 (in the cl error message) and CP1252 (in the link error) for the £ sign. When run from the command line without redirection, the output is in a consistent encoding. It's only when you redirect the output (I redirected to a file, I assume a pipe would be the same) that you get the problem. I'd be very surprised if build tool developers got this sort of edge case correct without at least some guidance from the PEP on the sorts of things they need to consider. You suggest "read raw bytes and encode them to UTF-8" - but you don't encode bytes, you encode strings, so you still need to convert those bytes to a string first, and there's no encoding you can reliably use for this. You need to use "errors=replace" to ensure you can handle inconsistently encoded data without getting an exception. Paul ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 20 May 2017 at 19:36, Steve Dower wrote: > >> - As a lazy developer, I don't want to read stdout/stderr from a >> subprocess only to spit it back to my own stdout/stderr. I'd much rather >> just launch the subprocess and let it use the same stdout/stderr as my >> build tool. > > > One of the open issues against distutils is that it does this. We can allow > it, but a well-defined tool should capture the output and pass it to the UI > component rather than bypassing the UI component. I'm a little concerned if we're going to end up with a proposal that means that distutils is in violation of the spec unless this issue is fixed. I'm not sure if that's where we're headed, but I wanted to be clear here - is PEP 517 intended to encompass distutils/setuptools, or are we treating them as a legacy case, that pip should special-case? Paul ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 20May2017 1011, Thomas Kluyver wrote: On Sat, May 20, 2017, at 05:36 PM, Steve Dower wrote: In general, since most subprocesses (at least on Windows) do not have customizable encodings, the tool that launches them needs to know what the encoding is. Since we don't live in a Python 3.6 world quite yet, that means the tool should read raw bytes from the compiler and encode them to UTF-8. I half agree, but: - Build tools may not 100% know what encoding output will be produced, especially if the developer can supply a custom command for the build tool to run. In this case, the whole thing breaks down anyway. UI can't be expected to reliably display text from an unknown encoding - at some point it has to be forced into a known quantity, and IMHO the code closest to the tool should do it. - It's possible for data on a pipe to be binary data with no meaning as text. Sure, but it cannot be rendered unless you choose an encoding. All you can do is dump it to a file (and let a file editor choose an encoding). - As a lazy developer, I don't want to read stdout/stderr from a subprocess only to spit it back to my own stdout/stderr. I'd much rather just launch the subprocess and let it use the same stdout/stderr as my build tool. One of the open issues against distutils is that it does this. We can allow it, but a well-defined tool should capture the output and pass it to the UI component rather than bypassing the UI component. So I think it's most practical to recommend that build tools produce UTF-8 (if not sys.stdout.isatty()), but let build tool developers decide how far they go to comply with that. Require that build tools either send UTF-8 to the UI component, or write bytes to a file and call it a build output. I see no benefit in requiring both the build tool and the UI tool to guess what the text encoding is. Cheers, Steve ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On Sat, May 20, 2017, at 05:36 PM, Steve Dower wrote: > I'll probably read the PEP closely and see that this is entirely > incorrect, but if it's right: > > * encoding for text between the build UI and build tool should just be > specified once for all platforms (i.e. use UTF-8). +1 > * encoding for text between build tool and the compiler depends on the > compiler > > In general, since most subprocesses (at least on Windows) do not have > customizable encodings, the tool that launches them needs to know what > the encoding is. Since we don't live in a Python 3.6 world quite yet, > that means the tool should read raw bytes from the compiler and encode > them to UTF-8. I half agree, but: - Build tools may not 100% know what encoding output will be produced, especially if the developer can supply a custom command for the build tool to run. - It's possible for data on a pipe to be binary data with no meaning as text. - As a lazy developer, I don't want to read stdout/stderr from a subprocess only to spit it back to my own stdout/stderr. I'd much rather just launch the subprocess and let it use the same stdout/stderr as my build tool. So I think it's most practical to recommend that build tools produce UTF-8 (if not sys.stdout.isatty()), but let build tool developers decide how far they go to comply with that. Thomas ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 20May2017 0820, Nick Coghlan wrote: Good point regarding the fact that the Windows 16-bit APIs only come into play for interactive sessions (even in 3.6+), while for PEP 517 we're specifically interested in the 8-bit pipes used to communicate with build subprocesses launched by an installation tool. I need to catch up on the PEP (and thanks Brett for alerting me to the thread), but this comment in particular cements the mental diagram I have right now: (build UI) <--> (build tool) <--> (compiler) ( Python ) <--> ( Python ) <--> (anything) I'll probably read the PEP closely and see that this is entirely incorrect, but if it's right: * encoding for text between the build UI and build tool should just be specified once for all platforms (i.e. use UTF-8). * encoding for text between build tool and the compiler depends on the compiler In general, since most subprocesses (at least on Windows) do not have customizable encodings, the tool that launches them needs to know what the encoding is. Since we don't live in a Python 3.6 world quite yet, that means the tool should read raw bytes from the compiler and encode them to UTF-8. The encoding between the tool and UI is essentially irrelevant - the UI is going to transform the data anyway for display, and the tool is going to have to transform it from the compilation tools, so the best we can do is pick the most likely encoding to avoid too many operations. UTF-8 is probably that. That's my 0.02AUD based on a vague memory of the PEP and this thread. As I get time today (at PyCon) to read up on it I may post amendments, but in general I'm +100 on "just pick an encoding and make the implementations transcode". Cheers, Steve ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On Fri, May 19, 2017, 09:20 Thomas Kluyver, wrote: > On Fri, May 19, 2017, at 05:17 PM, Paul Moore wrote: > > On 19 May 2017 at 16:53, Daniel Holth wrote: > > > Congrats on getting 518 in. > > > > Agreed, by the way. That's a big step! > > Thanks both. It does feel like an achievement. :-) > As it should! Thanks for bringing the PEP to life! -brett ___ > Distutils-SIG maillist - Distutils-SIG@python.org > https://mail.python.org/mailman/listinfo/distutils-sig > ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
Good point regarding the fact that the Windows 16-bit APIs only come into play for interactive sessions (even in 3.6+), while for PEP 517 we're specifically interested in the 8-bit pipes used to communicate with build subprocesses launched by an installation tool. On 20 May 2017 at 19:11, Paul Moore wrote: > The bigger question, though, is to what extent we want to mandate that > build tools that run external tools such as compilers take > responsibility for the encoding of the output of those tools (rather > than simply passing the output through to the output stream > unmodified). And if we do want to, whether we want to allow an > exception for setuptools/distutils. > > Also, a question regarding Unix - do we really want to mandate UTF-8 > even if the system locale is set to something else? Won't that mean > that build tools have the same problem with compilers generating > output in the encoding the tool wants that we already have on Windows? Yeah, I think that problem was starting to occur to me, hence the reference to handling RPM and DEB build environments. At least for non-Windows systems, I see two possible recommendations: 1. We advise installation tools to use binary streams to communicate with build tools, and treat the results as opaque binary data. If it needs to be written out to the installation tool's own streams, then use the binary level APIs for those interfaces to inject the build tool output directly, rather than decoding and re-encoding it first. 2. We advise installation tools to adopt a PEP 538 style solution, where they mostly just trust the result of locale.getpreferredencoding() *unless* "codecs.lookup(locale.getpreferredencoding()).name == 'ascii'". In the latter case, we'd advise them to set LC_CTYPE (and potentially LANG) appropriately for the running OS. Regardless of whether or not that locale coercion was needed, we'd recommend setting "replace" or "backslashreplace" when decoding the stream output from the subprocess. At the specification level, I think option 1 probably makes the most sense - we'd be advising insallation tools that they're free to kick any mojibake problems further down the automation pipeline if they don't want to worry about it. It's also the only one of the two recommendations we can readily make cross platform. At a quality-of-implementation level, there's a lot of potential value in option 2 (at least on non-Windows systems) - we just wouldn't require or recommend it at the level of the interoperability specifications. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 20 May 2017 at 09:03, Thomas Kluyver wrote: > On Sat, May 20, 2017, at 07:54 AM, Nick Coghlan wrote: >> * on platforms with 8-bit standard streams (e.g. Linux, Mac OS X), >> build systems SHOULD emit UTF-8 encoded output >> * on platforms with 16-bit standard streams (e.g. Windows), build >> systems SHOULD emit UTF-16-LE encoded output > > I'm quite prepared to accept that I'm mistaken, but my understanding is > that *standard streams* are 8-bit on Windows as well. The 16-bit thing > that Python 3.6 does, as I understand it, is to bypass standard streams > when it detects that they're connected to a console, and use a Windows > API call to write text to the console directly as UTF-16. > > If so, when stdout/stderr are pipes, which I assume is how pip captures > the output from build processes, there's no particular reason to send > UTF-16 data just because it's Windows. That's my understanding too. The standard streams are still byte streams with an encoding. It's just that the underlying IO when the final destination is the console, is done by the Windows Unicode APIs. Because of this, when the output is the console the stream can accept any unicode character and so an encoding of UTF8 is specified (and yes, AIUI there is a translation Unicode string -> UTF-8 bytes -> Unicode console API). For non-console output, though, the standard streams are still byte streams and the platform behaviour is respected, so we use the ANSI codepage (calling this the platform standard glosses over the fact that there are two standard codepages, ANSI and OEM, and tools don't always make the same choice when faced with piped output). Long story short, UTF-16 is irrelevant here. The docs for 3.6 say "Under Windows, if the stream is interactive (that is, if its isatty() method returns True), the console codepage is used, otherwise the ANSI code page". This is out of date (it was true for 3.5 and earlier). In 3.6+ utf-8 is used for interactive streams rather than the console codepage: >py -c "import sys; print(sys.stdout.encoding, file=sys.stderr)" utf-8 >py -c "import sys; print(sys.stdout.encoding, file=sys.stderr)" >$null cp1252 The bigger question, though, is to what extent we want to mandate that build tools that run external tools such as compilers take responsibility for the encoding of the output of those tools (rather than simply passing the output through to the output stream unmodified). And if we do want to, whether we want to allow an exception for setuptools/distutils. Also, a question regarding Unix - do we really want to mandate UTF-8 even if the system locale is set to something else? Won't that mean that build tools have the same problem with compilers generating output in the encoding the tool wants that we already have on Windows? My feeling is: 1. Build systems SHOULD emit output encoded in the preferred locale encoding (normally UTF-8 on Unix, ANSI on Windows). 2. Build systems should ideally check the encoding used by external tools that they run and transcode to the correct encoding if necessary - but this is a quality of implementation matter. 3. Install tools MUST NOT fail if build tools produce output with the wrong encoding, but MUST correctly reproduce build tool output if the build tools do produce the right encoding. My biggest concern with this is that I believe that Visual C produces output in the OEM codepage even when output to a pipe. Actually I just did some experiments (VS 2015), and it's even worse than that. The compiler (cl) seems to use the OEM code page when writing to a pipe, but the linker uses the ANSI code page. This means that a command like "cl a£bc" produces output on (a piped) stdout that contains mixed encodings. Given this situation, I think we have to simply give up and take the view that the Visual C tools are simply broken in this regard, and we shouldn't worry about them. So I'm inclined therefore to drop point (2) from the 3 above. Paul ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On Sat, May 20, 2017, at 07:54 AM, Nick Coghlan wrote: > * on platforms with 8-bit standard streams (e.g. Linux, Mac OS X), > build systems SHOULD emit UTF-8 encoded output > * on platforms with 16-bit standard streams (e.g. Windows), build > systems SHOULD emit UTF-16-LE encoded output I'm quite prepared to accept that I'm mistaken, but my understanding is that *standard streams* are 8-bit on Windows as well. The 16-bit thing that Python 3.6 does, as I understand it, is to bypass standard streams when it detects that they're connected to a console, and use a Windows API call to write text to the console directly as UTF-16. If so, when stdout/stderr are pipes, which I assume is how pip captures the output from build processes, there's no particular reason to send UTF-16 data just because it's Windows. Thomas ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 20 May 2017 at 01:16, Thomas Kluyver wrote: > On Fri, May 19, 2017, at 03:41 PM, Paul Moore wrote: >> Can we specify what encoding the informational text must be written >> in? > > Sure, that makes sense. What about: > > All hooks are run with working directory set to the root of the source > tree, and MAY print arbitrary informational text on stdout and stderr. > This text SHOULD be UTF-8 encoded, but as building may invoke other > processes, install tools MUST NOT fail if the data they receive is not > valid UTF-8; though in this case the display of the output may be > corrupted. > > Do we also want to recommend that install tools set > PYTHONIOENCODING=utf-8 when invoking build tools? Or leave this up to > the build tools? Setting PYTHONIOENCODING=utf-8:strict would potentially fail the "don't fail hard on misencoded output" requirement, and setting anything else is dubious from a potential data loss or compatibility point of view (as there's no "surrogateescape" error handler in Python 2). For use cases like distro package building, we'd also like to inherit the surrounding build environment, so explictly requiring installation tools to alter it at the Python level doesn't strike me as ideal. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 20 May 2017 at 00:18, Thomas Kluyver wrote: > Hi, > > I'd like to make another push for PEP 517, which would make it possible > to build wheels from a source tree with other build tools, without > needing setup.py. > > https://www.python.org/dev/peps/pep-0517/ > > Last time this was discussed we made a couple of minor changes to the > PEP, but we didn't want to accept another packaging related PEP until > PEP 518 was implemented in pip. I'm pleased to say that that > implementation has just been merged: > https://github.com/pypa/pip/pull/4144 . Huzzah, and congratulations! :) Regarding the encoding question, I agree with your recommendation with one key amendment to account for the 16-bit console APIs on Windows: * on platforms with 8-bit standard streams (e.g. Linux, Mac OS X), build systems SHOULD emit UTF-8 encoded output * on platforms with 16-bit standard streams (e.g. Windows), build systems SHOULD emit UTF-16-LE encoded output * on platforms that offer both, build systems SHOULD use the 16-bit streams to match the default behaviour of CPython 3.6+ * install tools MUST NOT fail the build solely due to improperly encoded output, but are otherwise free to handle the situation as they see fit Folks on Python 3.5 and earlier on Windows may still have problems given that guidance (since that uses the 8-bit stream interfaces with the Windows native encodings by default), but that's also a large part of why CPython's behaviour on Windows was changed in 3.6 :) Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On Fri, May 19, 2017, at 05:17 PM, Paul Moore wrote: > On 19 May 2017 at 16:53, Daniel Holth wrote: > > Congrats on getting 518 in. > > Agreed, by the way. That's a big step! Thanks both. It does feel like an achievement. :-) ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 19 May 2017 at 16:53, Daniel Holth wrote: > Congrats on getting 518 in. Agreed, by the way. That's a big step! Paul ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
Congrats on getting 518 in. On Fri, May 19, 2017, 11:37 Thomas Kluyver wrote: > On Fri, May 19, 2017, at 04:31 PM, Paul Moore wrote: > > For flit, would having the install tool set PYTHONIOENCODING help? > > If install tools were meant to set PYTHONIOENCODING, I probably wouldn't > do anything else in flit's code. Python should then take care of > ensuring that any output is UTF-8 encoded, and flit doesn't currently > invoke any separate processes to do the build. > > Thomas > ___ > Distutils-SIG maillist - Distutils-SIG@python.org > https://mail.python.org/mailman/listinfo/distutils-sig > ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On Fri, May 19, 2017, at 04:31 PM, Paul Moore wrote: > For flit, would having the install tool set PYTHONIOENCODING help? If install tools were meant to set PYTHONIOENCODING, I probably wouldn't do anything else in flit's code. Python should then take care of ensuring that any output is UTF-8 encoded, and flit doesn't currently invoke any separate processes to do the build. Thomas ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 19 May 2017 at 16:16, Thomas Kluyver wrote: > On Fri, May 19, 2017, at 03:41 PM, Paul Moore wrote: >> Can we specify what encoding the informational text must be written >> in? > > Sure, that makes sense. What about: > > All hooks are run with working directory set to the root of the source > tree, and MAY print arbitrary informational text on stdout and stderr. > This text SHOULD be UTF-8 encoded, but as building may invoke other > processes, install tools MUST NOT fail if the data they receive is not > valid UTF-8; though in this case the display of the output may be > corrupted. Looks good, although whether UTF-8 is viable on Windows is something I'll have to think about. > Do we also want to recommend that install tools set > PYTHONIOENCODING=utf-8 when invoking build tools? Or leave this up to > the build tools? Good question. At the moment, the only 2 cases I know of are setuptools/distutils and flit. For setuptools, I'm pretty sure there's no handling of subprocesses, it just fires them off and lets them write to the console - so there's nothing to even ensure a consistent encoding. We may have to allow for special casing with setuptools, as I doubt anyone's going to put in the effort to add a transcoding layer in there. For flit, would having the install tool set PYTHONIOENCODING help? I don't know immediately what I'd do if I were designing a brand new build tool that called out to a 3rd party compiler. Let me think about it. Paul ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On Fri, May 19, 2017, at 03:41 PM, Paul Moore wrote: > Can we specify what encoding the informational text must be written > in? Sure, that makes sense. What about: All hooks are run with working directory set to the root of the source tree, and MAY print arbitrary informational text on stdout and stderr. This text SHOULD be UTF-8 encoded, but as building may invoke other processes, install tools MUST NOT fail if the data they receive is not valid UTF-8; though in this case the display of the output may be corrupted. Do we also want to recommend that install tools set PYTHONIOENCODING=utf-8 when invoking build tools? Or leave this up to the build tools? Thomas ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] PEP 517 - specifying build system in pyproject.toml
On 19 May 2017 at 15:18, Thomas Kluyver wrote: > Hi, > > I'd like to make another push for PEP 517, which would make it possible > to build wheels from a source tree with other build tools, without > needing setup.py. A point that came up recently while dealing with a pip issue. """ All hooks are run with working directory set to the root of the source tree, and MAY print arbitrary informational text on stdout and stderr. """ Can we specify what encoding the informational text must be written in? At the moment pip has problems dealing with non-ASCII locales because it captures the build output and then displays it on error. This involves a decode/encode step (on Python 3) or printing arbitrary bytes to stdout (on Python 2). And at the moment we get UnicodeErrors if there's a mismatch. I've patched it to use errors=replace, but we still risk mojibake. Ideally, we should specify an encoding that hooks will use for output - but that's somewhat difficult as many build tools will want to do things like run compilers which could do arbitrarily silly things. I believe this is less of a problem on Unix (where there's a well-managed convention), but on Windows there's an "OEM" codepage for console programs, and an "ANSI" codepage for Windows programs - but not all programs use the same one - some console programs such as Mingw, I think, and Python itself if stdout is redirected (see https://docs.python.org/3.6/library/sys.html#sys.stdout) use the ANSI codepage. So we may have to fudge the situation a bit. (Maybe something like "Install tools MAY assume a specific encoding for the output, and MAY produce corrupted output if the build tool does not use that encoding, but install tools MUST NOT fail with an encoding error just because the encodings don't match"). But I don't think we should leave the situation completely unspecified. Paul. ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig