Re: [Python-Dev] PEP 3333: wsgi_string() function
On Sun, Jan 9, 2011 at 1:47 AM, Stephen J. Turnbull step...@xemacs.orgwrote: Robert Brewer writes: Python 3.1 was released June 27th, 2009. We're coming up faster on the two-year period than we seem to be on a revised WSGI spec. Maybe we should shoot for a bytes of a known encoding type first. You have one. It's called ISO 2022: Information processing -- ISO 7-bit and 8-bit coded character sets -- Code extension techniques. The popularity of that standard speaks for itself. The kind of object PJE was referring to is more like Ruby's strings, which do not embed the encoding inside the bytes themselves but have the encoding as a kind of annotation on the bytes, and do lazy transcoding when combining strings of different encodings. The goal with respect to WSGI is that you could annotate bytes with an encoding but also change or fix that encoding if other out-of-band information implied that you got the encoding wrong (e.g., some data is submitted with the encoding of the page the browser was on, and so nothing inside the request itself will indicate the encoding of the data). Latin1 is kind of the poor man's version of this -- it's a good guess at an encoding, that at worst requires transcoding that can be done in a predictable way. (Personally I think Latin1 gets us 99% of the way there, and so bytes-of-a-known-encoding are not really that important to the WSGI case.) Ian ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3333: wsgi_string() function
On 10/01/2011 17:24, Ian Bicking wrote: On Sun, Jan 9, 2011 at 1:47 AM, Stephen J. Turnbull step...@xemacs.org mailto:step...@xemacs.org wrote: Robert Brewer writes: Python 3.1 was released June 27th, 2009. We're coming up faster on the two-year period than we seem to be on a revised WSGI spec. Maybe we should shoot for a bytes of a known encoding type first. You have one. It's called ISO 2022: Information processing -- ISO 7-bit and 8-bit coded character sets -- Code extension techniques. The popularity of that standard speaks for itself. The kind of object PJE was referring to is more like Ruby's strings, which do not embed the encoding inside the bytes themselves but have the encoding as a kind of annotation on the bytes, and do lazy transcoding when combining strings of different encodings. The goal with respect to WSGI is that you could annotate bytes with an encoding but also change or fix that encoding if other out-of-band information implied that you got the encoding wrong (e.g., some data is submitted with the encoding of the page the browser was on, and so nothing inside the request itself will indicate the encoding of the data). Latin1 is kind of the poor man's version of this -- it's a good guess at an encoding, that at worst requires transcoding that can be done in a predictable way. (Personally I think Latin1 gets us 99% of the way there, and so bytes-of-a-known-encoding are not really that important to the WSGI case.) I think the language moratorium was not the only objection to the inclusion of a third string type in Python (the screwed string - safe to treat neither as bytes nor as text). I recall objections in principle too from core developers during the EuroPython language summit. Michael Ian ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk -- http://www.voidspace.org.uk/ May you do good and not evil May you find forgiveness for yourself and forgive others May you share freely, never taking more than you give. -- the sqlite blessing http://www.sqlite.org/different.html ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3333: wsgi_string() function
On Tue, Jan 11, 2011 at 3:24 AM, Ian Bicking i...@colorstudy.com wrote: The kind of object PJE was referring to is more like Ruby's strings, which do not embed the encoding inside the bytes themselves but have the encoding as a kind of annotation on the bytes, and do lazy transcoding when combining strings of different encodings. The goal with respect to WSGI is that you could annotate bytes with an encoding but also change or fix that encoding if other out-of-band information implied that you got the encoding wrong (e.g., some data is submitted with the encoding of the page the browser was on, and so nothing inside the request itself will indicate the encoding of the data). Latin1 is kind of the poor man's version of this -- it's a good guess at an encoding, that at worst requires transcoding that can be done in a predictable way. (Personally I think Latin1 gets us 99% of the way there, and so bytes-of-a-known-encoding are not really that important to the WSGI case.) Having done the upgrade to urllib to support direct manipulation of byte sequences, I don't think such a type would help as much people hoped anyway. Converting to Unicode, manipulating as text and converting back really *is* the right way to do text manipulation (however, providing bytes-in-bytes-out APIs that do the conversions for you can also be quite convenient). Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3333: wsgi_string() function
Ian Bicking writes: On Sun, Jan 9, 2011 at 1:47 AM, Stephen J. Turnbull step...@xemacs.orgwrote: Robert Brewer writes: Python 3.1 was released June 27th, 2009. We're coming up faster on the two-year period than we seem to be on a revised WSGI spec. Maybe we should shoot for a bytes of a known encoding type first. You have one. It's called ISO 2022: Information processing -- ISO 7-bit and 8-bit coded character sets -- Code extension techniques. The popularity of that standard speaks for itself. The kind of object PJE was referring to is more like Ruby's strings, Notice that Ruby was written by a Japanese, the same culture that brought us Mule, TRON, X Compound Text, and ISO-2022 in the first place. Matsumoto himself probably isn't infected with the Unicode is going to be the death of all Japanese culture bug, but that's the attitude that is behind ISO 2022. which do not embed the encoding inside the bytes themselves but have the encoding as a kind of annotation on the bytes, My pointis that ISO-2022 is basically just a serialization of that. And it sucks; nobody uses it, except in Japanese and Korean email. Maybe Mandarin (but Taiwan and Hong Kong use Big5 or EUC, not an escape-extended representation). and do lazy transcoding when combining strings of different encodings. Which buys WSGI nothing, AIUI, since the people who want this claim that translating to Unicode either correctly or as big bytes (ie, zero-extension) is inefficient. They're shoveling bits; much of the time, by the time the out-of-band information catches up, it's going to be too late. The goal with respect to WSGI is that you could annotate bytes with an encoding but also change or fix that encoding if other out-of-band information implied that you got the encoding wrong (e.g., some data is submitted with the encoding of the page the browser was on, and so nothing inside the request itself will indicate the encoding of the data). A noble goal, but nobody's gonna bell that cat. This is all just wishful thinking. 2 decades of experience with Emacs/Mule and similar efforts show that if you provide this facility, people will use it, and that use will include a lot of abuse (ie, throwing the garbage into somebody else's backyard, rather than disposing of it yourself) -- in the end, the garbage gets piled high enough that it's not worth the effort to try to make it work. Latin1 is kind of the poor man's version of this -- it's a good guess at an encoding, that at worst requires transcoding that can be done in a predictable way. (Personally I think Latin1 gets us 99% of the way there, and so bytes-of-a-known-encoding are not really that important to the WSGI case.) In particular, it gets PJE 100% of the way there, since he proposes always targeting ISO 8859/1, anyway. And if it's not useful to WSGI, who is it useful to? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3333: wsgi_string() function
Robert Brewer writes: Python 3.1 was released June 27th, 2009. We're coming up faster on the two-year period than we seem to be on a revised WSGI spec. Maybe we should shoot for a bytes of a known encoding type first. You have one. It's called ISO 2022: Information processing -- ISO 7-bit and 8-bit coded character sets -- Code extension techniques. The popularity of that standard speaks for itself. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3333: wsgi_string() function
On Tue, 2011-01-04 at 03:44 +0100, Victor Stinner wrote: What is this horrible encoding bytes-as-unicode? It is a unicode string decoded from bytes using ISO-8859-1. ISO-8859-1 is the encoding specified by the HTTP RFC, as well as having the happy property of preserving every input byte. os.environ is supposed to be correctly decoded and contain valid unicode characters. Nope. It is not possible to ‘correctly’ decode to unicode for os.environ because that decoding happens long before the web application gets a look in. Maybe the web application is using UTF-8, maybe it's using cp1252, but if we let the server/gateway decide and do that decoding before the application can do anything about it, we will get the wrong encoding in *many* cases and the result will be permanent, unrecoverable mangling of non-ASCII characters in submitted headers. If WSGI uses another encoding than the locale encoding (which is a bad idea), It's an absolutely necessary idea. The locale encoding is nothing to do with the web application's encoding. Windows applications need to be able to use UTF-8 (which is never the ANSI code page), and web applications in general need to be deployable to any server without having to worry about the server's locale. The locale-dependent status quo is that non-ASCII characters in URL paths and other HTTP headers don't work for Python apps. The recoding dances present in wsgiref's CGIHandler for 3.2 are distasteful but completely necessary to normalise differences in encodings used by various servers and platforms to generate their CGI environment. it should use os.environb and decodes keys and values using its own encoding. Well yes, but: (a) os.environb doesn't exist in previous Python 3.1, making it impossible to implement WSGI before 3.2; (b) there are also non-HTTP-related environment variables, which may contain native Unicode strings (eg, very commonly, Windows pathnames), so you have to have both environ *and* environb. The bytes-or-bytes-in-Unicode argument is something that has been bounced around Web-SIG for literally *years*; this is what we ended up with. Although I personally like bytes, frankly, a re-run of this argument *again* whilst WSGI remains in perpetual stalemate does not appeal. WSGI and wsgiref in Python 3.0-3.1 simply not work at all. This has been an embarrassing situation for what is supposed to be a leading web language. Let's not perpetuate this sorry story to 3.2 as well. -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com skype:uknrbobince gtalk:chat?jid=bobi...@gmail.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3333: wsgi_string() function
Le jeudi 06 janvier 2011 à 23:50 +, And Clover a écrit : On Tue, 2011-01-04 at 03:44 +0100, Victor Stinner wrote: What is this horrible encoding bytes-as-unicode? It is a unicode string decoded from bytes using ISO-8859-1. ISO-8859-1 is the encoding specified by the HTTP RFC, as well as having the happy property of preserving every input byte. PEP requires it. ISO-8859-1 for all fields: SERVER_NAME, PATH_INFO, the URL, form data, ...? os.environ is supposed to be correctly decoded and contain valid unicode characters. It is not possible to ‘correctly’ decode to unicode for os.environ because that decoding happens long before the web application (the only party that knows what encoding should be in use) gets a look in. Agreed. Maybe the web application is using UTF-8, maybe it's using cp1252, but if we let the server/gateway decide and do that decoding (...) It's an absolutely necessary idea. The locale encoding is nothing to do with the web application's encoding. (...) Ok, so you must pass byte strings to the server/gateway. If you pass unicode, how do the server/gateway know that it has to redecode a value? Should it redecode all values? Anything, it is stupid to use a temporary useless pseudo-encoding (bytes-in-unicode). The recoding dances present in wsgiref's CGIHandler for 3.2 are distasteful but completely necessary to normalise differences in encodings used by various servers and platforms to generate their CGI environment. I don't understand why read_environ() gives unicode values: as you explained, the server/gateway will have to encode the values again, and then finally to decode them from the correct encoding. On POSIX, the current code looks like that: a) the OS pass a bytes environ to the program b) Python decodes environ from the locale encoding c) wsgi.read_environ() encodes environ to the locale encoding to get back the original bytes environ: this step can be skipped if os.environb is available d) wsgi.read_environ() decodes environ from ISO-8859-1 e) the server/gateway encodes environ to ISO-8859-1 f) the server/gateway decodes environ from the right encoding Hey! Don't you think that there are useless encode/decode steps here? Especially (d)-(e) is useless and introduces a confusion: the environ contains other keys that don't come from os.environ and are already correctly decoded, how do the the server/gateway know that they are already correctly decoded? I propose simply (for Python 3.2): a) the OS pass a bytes environ to the program: wsgi.read_environ() uses it b) the server/gateway decodes environ from the right encoding and... (a) os.environb doesn't exist in previous Python 3.1, making it impossible to implement WSGI before 3.2; For Python 3.1, add a step between (a) and (b): encode environ to the locale encoding (with surrogateescape) to get back the original bytes environ. (b) a byte environment on Windows would have to be encoded from the Unicode environment, with a server-specific encoding, and then what encoding are you going to choose for the variables that contain non-HTTP-sourced native Unicode strings (such as, very commonly, Windows pathnames)? The variables coming from the HTTP server should be encoded again to the server-specific encoding. Other variables should be kept unchanged. The server/gateway can simply test the type of the variable: if it's uncode, nothing to do, if it's bytes: decode it from the correct encoding. The bytes-or-bytes-in-Unicode argument is something that has been bounced around Web-SIG for literally *years*; (...) WSGI and wsgiref in Python 3.0-3.1 simply does not work. I don't understand why you are attached to this horrible hack (bytes-in-unicode). It introduces more work and more confusing than using raw bytes unchanged. It doesn't work and so something has to be changed. Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3333: wsgi_string() function
Victor Stinner writes: It doesn't work and so something has to be changed. What specific bug have you observed? Everybody hates this hack, or at the very least is somewhat embarrassed by it, but the working group clearly believes that it works and something like it is necessary. They've studied it for years. To get rid of it, somebody needs to demonstrate a bug, and propose something better, plus implement it in code, plus fix any tests that expect Unicode and now get bytes, plus create any additional tests that may be necessitated by changing from a Unicode representation to a bytes representation. I hate it too, but not enough to to ask anybody to do any of the above without a real bug. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3333: wsgi_string() function
On Fri, Jan 7, 2011 at 9:51 PM, Victor Stinner victor.stin...@haypocalc.com wrote: On POSIX, the current code looks like that: a) the OS pass a bytes environ to the program b) Python decodes environ from the locale encoding c) wsgi.read_environ() encodes environ to the locale encoding to get back the original bytes environ: this step can be skipped if os.environb is available d) wsgi.read_environ() decodes environ from ISO-8859-1 e) the server/gateway encodes environ to ISO-8859-1 f) the server/gateway decodes environ from the right encoding Hey! Don't you think that there are useless encode/decode steps here? Especially (d)-(e) is useless and introduces a confusion: the environ contains other keys that don't come from os.environ and are already correctly decoded, how do the the server/gateway know that they are already correctly decoded? Because WSGI is platform neutral. WSGI apps have no idea if they're running on Windows or POSIX. The type used to communicate between the WSGI engine and the WSGI must be either bytes *or* unicode, and either choice causes problems depending on the underlying OS. bytes-as-unicode is not a great choice, it is merely the least bad choice of the available options. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3333: wsgi_string() function
On Jan 7, 2011, at 6:51 AM, Victor Stinner wrote: I don't understand why you are attached to this horrible hack (bytes-in-unicode). It introduces more work and more confusing than using raw bytes unchanged. It doesn't work and so something has to be changed. It's gross but it does work. This has been discussed ad-nausium on web-sig over a period of years. I'd like to reiterate that it is only even a potential issue for the PATH_INFO/SCRIPT_NAME keys. Those two keys are required to have been urldecoded already, into byte-data in some encoding. For all the other keys (including the ones from os.environ), they are either *properly* decoded in 8859-1 or are just ascii (possibly still urlencoded, so the app needs to urldecode and decode into a string with the correct encoding). James ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3333: wsgi_string() function
At 09:43 AM 1/7/2011 -0500, James Y Knight wrote: On Jan 7, 2011, at 6:51 AM, Victor Stinner wrote: I don't understand why you are attached to this horrible hack (bytes-in-unicode). It introduces more work and more confusing than using raw bytes unchanged. It doesn't work and so something has to be changed. It's gross but it does work. This has been discussed ad-nausium on web-sig over a period of years. I'd like to reiterate that it is only even a potential issue for the PATH_INFO/SCRIPT_NAME keys. Those two keys are required to have been urldecoded already, into byte-data in some encoding. For all the other keys (including the ones from os.environ), they are either *properly* decoded in 8859-1 or are just ascii (possibly still urlencoded, so the app needs to urldecode and decode into a string with the correct encoding). Right. Also, it should be mentioned that none of this would be necessary if we could've gotten a bytes of a known encoding type. If you look back to the last big Python-Dev discussion on bytes/unicode and stdlib API breakage, this was the holdup for getting a sane WSGI spec. Since we couldn't change the language to fix the problem (due to the moratorium), we had to use this less-pleasant way of dealing with things, in order to get a final WSGI spec for Python 3. (If anybody is wondering about the specifics of the language change that was needed, it'd be having a bytes with known encoding type, that when combined in any polymorphic operation with a unicode string, would result in bytes-with-encoding output, and would raise an error if the resulting value could not be encoded in the target encoding. Then we would simply do all WSGI header operations with this type, using latin-1 as the target encoding.) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3333: wsgi_string() function
P.J. Eby wrote: At 09:43 AM 1/7/2011 -0500, James Y Knight wrote: On Jan 7, 2011, at 6:51 AM, Victor Stinner wrote: I don't understand why you are attached to this horrible hack (bytes-in-unicode). It introduces more work and more confusing than using raw bytes unchanged. It doesn't work and so something has to be changed. It's gross but it does work. This has been discussed ad-nausium on web-sig over a period of years. I'd like to reiterate that it is only even a potential issue for the PATH_INFO/SCRIPT_NAME keys. Those two keys are required to have been urldecoded already, into byte-data in some encoding. For all the other keys (including the ones from os.environ), they are either *properly* decoded in 8859-1 or are just ascii (possibly still urlencoded, so the app needs to urldecode and decode into a string with the correct encoding). Right. Also, it should be mentioned that none of this would be necessary if we could've gotten a bytes of a known encoding type. If you look back to the last big Python-Dev discussion on bytes/unicode and stdlib API breakage, this was the holdup for getting a sane WSGI spec. Since we couldn't change the language to fix the problem (due to the moratorium), we had to use this less-pleasant way of dealing with things, in order to get a final WSGI spec for Python 3. (If anybody is wondering about the specifics of the language change that was needed, it'd be having a bytes with known encoding type, that when combined in any polymorphic operation with a unicode string, would result in bytes-with-encoding output, and would raise an error if the resulting value could not be encoded in the target encoding. Then we would simply do all WSGI header operations with this type, using latin-1 as the target encoding.) Still looking forward to the day when that moratorium is lifted. Anyone have any idea when that will be? Bob ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3333: wsgi_string() function
On 7 January 2011 18:36, Robert Brewer fuman...@aminus.org wrote: Still looking forward to the day when that moratorium is lifted. Anyone have any idea when that will be? See PEP 3003 (http://www.python.org/dev/peps/pep-3003/) - Python 3.3 is expected to be post-moratorium. Paul. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3333: wsgi_string() function
P.J. Eby p...@telecommunity.com wrote: Right. Also, it should be mentioned that none of this would be necessary if we could've gotten a bytes of a known encoding type. Indeed! Or even string using a known encoding... If you look back to the last big Python-Dev discussion on bytes/unicode and stdlib API breakage, this was the holdup for getting a sane WSGI spec. Yep. Bill ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3333: wsgi_string() function
Paul Moore wrote: Robert Brewer fuman...@aminus.org wrote: P.J. Eby wrote: Also, it should be mentioned that none of this would be necessary if we could've gotten a bytes of a known encoding type. Still looking forward to the day when that moratorium is lifted. Anyone have any idea when that will be? See PEP 3003 (http://www.python.org/dev/peps/pep-3003/) - Python 3.3 is expected to be post-moratorium. This PEP proposes a temporary moratorium (suspension) of all changes to the Python language syntax, semantics, and built-ins for a period of at least two years from the release of Python 3.1. Python 3.1 was released June 27th, 2009. We're coming up faster on the two-year period than we seem to be on a revised WSGI spec. Maybe we should shoot for a bytes of a known encoding type first. Robert Brewer fuman...@aminus.org ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3333: wsgi_string() function
On Sat, Jan 8, 2011 at 6:16 AM, Robert Brewer fuman...@aminus.org wrote: Python 3.1 was released June 27th, 2009. We're coming up faster on the two-year period than we seem to be on a revised WSGI spec. Maybe we should shoot for a bytes of a known encoding type first. There were a few minor* practical issues in getting agreement on how such a type would actually behave. Instead, the approach WSGI adopted (or the stricter, 7-bit ASCII only approach used internally by urllib.parse to handle bytes in 3.2) was deemed sufficient, since it could be done right now without having to agree on how many different bikesheds were needed and what colours they should all be. Cheers, Nick. *i.e. major :) -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3333: wsgi_string() function
On Tue, 2011-01-04 at 03:44 +0100, Victor Stinner wrote: What is this horrible encoding bytes-as-unicode? It is a unicode string decoded from bytes using ISO-8859-1. ISO-8859-1 is the encoding specified by the HTTP RFC, as well as having the happy property of preserving every input byte. PEP requires it. os.environ is supposed to be correctly decoded and contain valid unicode characters. It is not possible to ‘correctly’ decode to unicode for os.environ because that decoding happens long before the web application (the only party that knows what encoding should be in use) gets a look in. Maybe the web application is using UTF-8, maybe it's using cp1252, but if we let the server/gateway decide and do that decoding before the application can do anything about it, we will get the wrong encoding in *many* cases and the result will be permanent, unrecoverable mangling of non-ASCII characters in submitted headers. If WSGI uses another encoding than the locale encoding (which is a bad idea), It's an absolutely necessary idea. The locale encoding is nothing to do with the web application's encoding. Windows applications need to be able to use UTF-8 (which is never the ANSI code page), and web applications in general need to be deployable to any server without having to worry about the server's locale. The locale-dependent status quo is that non-ASCII characters in URL paths and other HTTP headers don't work for Python apps. The recoding dances present in wsgiref's CGIHandler for 3.2 are distasteful but completely necessary to normalise differences in encodings used by various servers and platforms to generate their CGI environment. it should use os.environb and decodes keys and values using its own encoding. Well yes, but: (a) os.environb doesn't exist in previous Python 3.1, making it impossible to implement WSGI before 3.2; (b) a byte environment on Windows would have to be encoded from the Unicode environment, with a server-specific encoding, and then what encoding are you going to choose for the variables that contain non-HTTP-sourced native Unicode strings (such as, very commonly, Windows pathnames)? The bytes-or-bytes-in-Unicode argument is something that has been bounced around Web-SIG for literally *years*; this is what we ended up with. Although I personally like bytes, frankly, a re-run of this argument *again* whilst WSGI remains in perpetual stalemate does not appeal. WSGI and wsgiref in Python 3.0-3.1 simply does not work. This has long been an embarrassing situation for what is supposed to be a leading web language. Let us not perpetuate this sorry story to 3.2 as well. -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com skype:uknrbobince gtalk:chat?jid=bobi...@gmail.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3333: wsgi_string() function
Can you please take a look at http://docs.python.org/dev/whatsnew/3.2.html#pep--python-web-server-gateway-interface-v1-0-1 to see if it accurately recaps the resolution of the WSGI text/bytes issues. I would appreciate any feedback, as it is likely that the whatsnew document will be most people's first chance to hear the outcome of the multi-year discussion. Thanks, Raymond On Jan 6, 2011, at 3:50 PM, And Clover wrote: On Tue, 2011-01-04 at 03:44 +0100, Victor Stinner wrote: What is this horrible encoding bytes-as-unicode? It is a unicode string decoded from bytes using ISO-8859-1. ISO-8859-1 is the encoding specified by the HTTP RFC, as well as having the happy property of preserving every input byte. PEP requires it. os.environ is supposed to be correctly decoded and contain valid unicode characters. It is not possible to ‘correctly’ decode to unicode for os.environ because that decoding happens long before the web application (the only party that knows what encoding should be in use) gets a look in. Maybe the web application is using UTF-8, maybe it's using cp1252, but if we let the server/gateway decide and do that decoding before the application can do anything about it, we will get the wrong encoding in *many* cases and the result will be permanent, unrecoverable mangling of non-ASCII characters in submitted headers. If WSGI uses another encoding than the locale encoding (which is a bad idea), It's an absolutely necessary idea. The locale encoding is nothing to do with the web application's encoding. Windows applications need to be able to use UTF-8 (which is never the ANSI code page), and web applications in general need to be deployable to any server without having to worry about the server's locale. The locale-dependent status quo is that non-ASCII characters in URL paths and other HTTP headers don't work for Python apps. The recoding dances present in wsgiref's CGIHandler for 3.2 are distasteful but completely necessary to normalise differences in encodings used by various servers and platforms to generate their CGI environment. it should use os.environb and decodes keys and values using its own encoding. Well yes, but: (a) os.environb doesn't exist in previous Python 3.1, making it impossible to implement WSGI before 3.2; (b) a byte environment on Windows would have to be encoded from the Unicode environment, with a server-specific encoding, and then what encoding are you going to choose for the variables that contain non-HTTP-sourced native Unicode strings (such as, very commonly, Windows pathnames)? The bytes-or-bytes-in-Unicode argument is something that has been bounced around Web-SIG for literally *years*; this is what we ended up with. Although I personally like bytes, frankly, a re-run of this argument *again* whilst WSGI remains in perpetual stalemate does not appeal. WSGI and wsgiref in Python 3.0-3.1 simply does not work. This has long been an embarrassing situation for what is supposed to be a leading web language. Let us not perpetuate this sorry story to 3.2 as well. -- And Clover mailto:a...@doxdesk.com http://www.doxdesk.com skype:uknrbobince gtalk:chat?jid=bobi...@gmail.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/raymond.hettinger%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3333: wsgi_string() function
On 1/6/2011 3:50 PM, And Clover wrote: ISO-8859-1 is the encoding specified by the HTTP RFC Please could I have the reference to that specification? I only recall ASCII and UTF-8 in my readings of various things HTTP and HTML, for headers, and form data. Naturally data pages can have any encoding they please, as there are headers and meta tags to describe their encodings. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3333: wsgi_string() function
On Jan 6, 2011, at 8:16 PM, Glenn Linderman wrote: On 1/6/2011 3:50 PM, And Clover wrote: ISO-8859-1 is the encoding specified by the HTTP RFC Please could I have the reference to that specification? I only recall ASCII and UTF-8 in my readings of various things HTTP and HTML, for headers, and form data. Naturally data pages can have any encoding they please, as there are headers and meta tags to describe their encodings. Did you try google? http://www.google.com/search?http+rfc James ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3333: wsgi_string() function
Glenn Linderman writes: On 1/6/2011 3:50 PM, And Clover wrote: ISO-8859-1 is the encoding specified by the HTTP RFC Please could I have the reference to that specification? RFC 2616 (probably obsolete by now, but IRC ISO 8859/1 is already there IIRC), and I don't think UTF-8 is the default for anything until you get to XHTML (and maybe HTML5). ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3333: wsgi_string() function
On 1/6/2011 7:37 PM, Stephen J. Turnbull wrote: Glenn Linderman writes: On 1/6/2011 3:50 PM, And Clover wrote: ISO-8859-1 is the encoding specified by the HTTP RFC Please could I have the reference to that specification? RFC 2616 (probably obsolete by now, but IRC ISO 8859/1 is already there IIRC), and I don't think UTF-8 is the default for anything until you get to XHTML (and maybe HTML5). Thanks. Looking back, it is 2068 and 1945 also, I just had a mental blind spot, thinking I understood the header formats from email-land, where they are more required to be ASCII, as mentioned in my reply to James. UTF-8 is the default for FORM DATA when using multipart/form-data encoding, using the POST method. Otherwise, it FORM DATA is limited to ASCII. Per http://www.w3.org/TR/html401/interact/forms.html#h-17.13.1 which is HTML 4.01 (and maybe earlier, but I didn't go back further). Nice to quote chapter and verse (or link) when declaring that something is in a standard. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3333: wsgi_string() function
At 04:00 PM 1/6/2011 -0800, Raymond Hettinger wrote: Can you please take a look at http://docs.python.org/dev/whatsnew/3.2.html#pep--python-web-server-gateway-interface-v1-0-1http://docs.python.org/dev/whatsnew/3.2.html#pep--python-web-server-gateway-interface-v1-0-1 to see if it accurately recaps the resolution of the WSGI text/bytes issues. I would appreciate any feedback, as it is likely that the whatsnew document will be most people's first chance to hear the outcome of the multi-year discussion. Hi Raymond -- nice work there. A few minor suggestions: 1. Native strings are used as the keys and values of the environ dictionary, not just as headers for start_response. 2. The read_environ() method is strictly for use with CGI-to-WSGI gateways, or for bridging other CGI-like protocols (e.g. FastCGI) to WSGI. It is ONLY for server implementers, in other words, and the typical app developer is doing something terribly wrong if they are even bothering to read its documentation. ;-) 3. The primary relevance of the native string type to an app developer is that when porting code from Python 2 to 3, they must still decode environment variable values, even though they are already Unicode. If their code was previously dealing only in Python 2 'str' objects, then nothing really changes. If they were previously decoding from environ str's to unicode, then they must replace their prior .decode('whatever') with .encode('latin1').decode('whatever'). That's basically it for porting from Python 2. IOW, this design choice allows most HTTP header manipulating code (whether input or output) to be ported to Python 3 with a very mechanical change pattern. Most such code is working with ASCII anyway, since normally both input and output headers are, and there are few headers that an application would be likely to convert to actual unicode anyway. On output via send_response(), if an application is currently encoding an output header -- why they would be, I have no idea, but if they are -- they need to add a re-encode to latin1. (i.e., .encode('whatever').decode('latin1')) IOW, a short 2-to-3 porting guide for WSGI: * If you just used strings for headers before, that part of your code doesn't change. (And if it was broken before, it's still broken in exactly the same way. No new breakage is introduced. ;-) ) * If you encoded any output headers or decoded any input headers, you must take into account the extra latin1 step. This is expected to be rare, since it's usually only SCRIPT_NAME and PATH_INFO that anybody would ever care about on input, and almost never anything on output. * Values yielded by an application or sent via a write() call MUST be byte strings; The environ and start_response() MUST be native strings. No mixing and matching. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3333: wsgi_string() function
On Tue, 04 Jan 2011 03:44:53 +0100 Victor Stinner victor.stin...@haypocalc.com wrote: def wsgi_string(u): # Convert an environment variable to a WSGI bytes-as-unicode string return u.encode(enc, esc).decode('iso-8859-1') def run_with_cgi(application): environ = {k: wsgi_string(v) for k,v in os.environ.items()} environ['wsgi.input']= sys.stdin environ['wsgi.errors'] = sys.stderr environ['wsgi.version'] = (1, 0) ... -- What is this horrible encoding bytes-as-unicode? os.environ is supposed to be correctly decoded and contain valid unicode characters. If WSGI uses another encoding than the locale encoding (which is a bad idea), it should use os.environb and decodes keys and values using its own encoding. If you really want to store bytes in unicode, str is not the right type: use the bytes type and use os.environb instead. +1. We should minimize such reencoding dances, and avoid promoting them. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3333: wsgi_string() function
Le mardi 04 janvier 2011 à 13:20 +0100, Antoine Pitrou a écrit : On Tue, 04 Jan 2011 03:44:53 +0100 Victor Stinner victor.stin...@haypocalc.com wrote: def wsgi_string(u): # Convert an environment variable to a WSGI bytes-as-unicode string return u.encode(enc, esc).decode('iso-8859-1') def run_with_cgi(application): environ = {k: wsgi_string(v) for k,v in os.environ.items()} environ['wsgi.input']= sys.stdin environ['wsgi.errors'] = sys.stderr environ['wsgi.version'] = (1, 0) ... -- What is this horrible encoding bytes-as-unicode? os.environ is supposed to be correctly decoded and contain valid unicode characters. If WSGI uses another encoding than the locale encoding (which is a bad idea), it should use os.environb and decodes keys and values using its own encoding. If you really want to store bytes in unicode, str is not the right type: use the bytes type and use os.environb instead. +1. We should minimize such reencoding dances, and avoid promoting them. The example from the PEP is specific to CGI and is a little bit special. The reference implementation (wsgiref in py3k) only redecodes (transcode) some variables: --- _is_request = { 'SCRIPT_NAME', 'PATH_INFO', 'QUERY_STRING', 'REQUEST_METHOD', 'AUTH_TYPE', 'CONTENT_TYPE', 'CONTENT_LENGTH', 'HTTPS', 'REMOTE_USER', 'REMOTE_IDENT', }.__contains__ def _needs_transcode(k): return _is_request(k) or k.startswith('HTTP_') or k.startswith('SSL_') \ or (k.startswith('REDIRECT_') and _needs_transcode(k[9:])) --- My problem is that I don't understand how I can know if a variable was converted to bytes-as-unicode or not. GrahamDumpleton told me on IRC, that the framework is supposed to redecodes one more time some variables (eg. PATH_INFO). But this is not explicit in the PEP and _needs_transcode() is a private function. Since the environ already contain different types (eg. wsgi.version is a tuple, wsgi.multithread is a boolean, ...), why not keeping these variables as raw bytes? Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3333: wsgi_string() function
On Tue, 04 Jan 2011 14:33:37 +0100 Victor Stinner victor.stin...@haypocalc.com wrote: Le mardi 04 janvier 2011 à 13:20 +0100, Antoine Pitrou a écrit : On Tue, 04 Jan 2011 03:44:53 +0100 Victor Stinner victor.stin...@haypocalc.com wrote: def wsgi_string(u): # Convert an environment variable to a WSGI bytes-as-unicode string return u.encode(enc, esc).decode('iso-8859-1') def run_with_cgi(application): environ = {k: wsgi_string(v) for k,v in os.environ.items()} environ['wsgi.input']= sys.stdin environ['wsgi.errors'] = sys.stderr environ['wsgi.version'] = (1, 0) ... -- What is this horrible encoding bytes-as-unicode? os.environ is supposed to be correctly decoded and contain valid unicode characters. If WSGI uses another encoding than the locale encoding (which is a bad idea), it should use os.environb and decodes keys and values using its own encoding. If you really want to store bytes in unicode, str is not the right type: use the bytes type and use os.environb instead. +1. We should minimize such reencoding dances, and avoid promoting them. The example from the PEP is specific to CGI and is a little bit special. Well, it would be better if it used os.environb anyway ;) Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3333: wsgi_string() function
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 01/03/2011 09:44 PM, Victor Stinner wrote: Hi, In the PEP , I read: -- import os, sys enc, esc = sys.getfilesystemencoding(), 'surrogateescape' def wsgi_string(u): # Convert an environment variable to a WSGI bytes-as-unicode string return u.encode(enc, esc).decode('iso-8859-1') def run_with_cgi(application): environ = {k: wsgi_string(v) for k,v in os.environ.items()} environ['wsgi.input']= sys.stdin environ['wsgi.errors'] = sys.stderr environ['wsgi.version'] = (1, 0) ... -- What is this horrible encoding bytes-as-unicode? os.environ is supposed to be correctly decoded and contain valid unicode characters. If WSGI uses another encoding than the locale encoding (which is a bad idea), it should use os.environb and decodes keys and values using its own encoding. If you really want to store bytes in unicode, str is not the right type: use the bytes type and use os.environb instead. I'm not clear on the semantics here, but I'm pretty sure you'll find that the web-SIG does know them well. I have CC'ed that list (via gmane). Note that Guido just recently wrote on that list that he considers that PEP to be de facto accepted. Tres. - -- === Tres Seaver +1 540-429-0999 tsea...@palladion.com Palladion Software Excellence by Designhttp://palladion.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk0jSTUACgkQ+gerLs4ltQ4cCQCgyc9QsRfzC2lrtnDO0v8TvK6W rVwAnjvvwD47J1chgupqM3unt5c2jd6p =8LEf -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3333: wsgi_string() function
At 03:44 AM 1/4/2011 +0100, Victor Stinner wrote: Hi, In the PEP , I read: -- import os, sys enc, esc = sys.getfilesystemencoding(), 'surrogateescape' def wsgi_string(u): # Convert an environment variable to a WSGI bytes-as-unicode string return u.encode(enc, esc).decode('iso-8859-1') def run_with_cgi(application): environ = {k: wsgi_string(v) for k,v in os.environ.items()} environ['wsgi.input']= sys.stdin environ['wsgi.errors'] = sys.stderr environ['wsgi.version'] = (1, 0) ... -- What is this horrible encoding bytes-as-unicode? os.environ is supposed to be correctly decoded and contain valid unicode characters. If WSGI uses another encoding than the locale encoding (which is a bad idea), it should use os.environb and decodes keys and values using its own encoding. If you really want to store bytes in unicode, str is not the right type: use the bytes type and use os.environb instead. If you want to discuss this, the Web-SIG is the appropriate place. Also, it was the appropriate place months ago, when the final decision on the environ encoding was made. ;-) IOW, the above change to the PEP is merely fixing the code example to be correct for Python 3, where it previously was correct only for Python 2. The PEP itself has already required this since the previous revisions, and wsgiref in the stdlib is already compliant with the above (although it uses a more sophisticated approach for dealing with win32 compatibility). The rationale for this choice is described in the PEP, and was also discussed in the mailing list emails back when the work was being done. IOW, this particular ship already sailed a long time ago. In fact, for Jython this bytes-as-unicode approach has been the PEP 333-defined encoding for at least *six years*... so it's REALLY late to complain about it now! ;-) PEP is merely a mapping of PEP 333 to allow WSGI apps to be ported from Python 2 to Python 3. There is work in progress on the Web-SIG now on PEP 444, which will support only Python 2.6+, where 'b' literals and the 'bytes' alias are available. It is as yet uncertain what environ encoding will be used, but at the moment I'm not convinced that either pure bytes or pure unicode are acceptable replacements for the PEP 333-compatible approach. In any event, that is a discussion for the Web-SIG, not Python-Dev. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 3333: wsgi_string() function
On Tue, Jan 4, 2011 at 8:22 AM, Tres Seaver tsea...@palladion.com wrote: Note that Guido just recently wrote on that list that he considers that PEP to be de facto accepted. That was conditional on there not being any objections in the next 24 hours. There have been plenty, so I'm retracting that. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] PEP 3333: wsgi_string() function
Hi, In the PEP , I read: -- import os, sys enc, esc = sys.getfilesystemencoding(), 'surrogateescape' def wsgi_string(u): # Convert an environment variable to a WSGI bytes-as-unicode string return u.encode(enc, esc).decode('iso-8859-1') def run_with_cgi(application): environ = {k: wsgi_string(v) for k,v in os.environ.items()} environ['wsgi.input']= sys.stdin environ['wsgi.errors'] = sys.stderr environ['wsgi.version'] = (1, 0) ... -- What is this horrible encoding bytes-as-unicode? os.environ is supposed to be correctly decoded and contain valid unicode characters. If WSGI uses another encoding than the locale encoding (which is a bad idea), it should use os.environb and decodes keys and values using its own encoding. If you really want to store bytes in unicode, str is not the right type: use the bytes type and use os.environb instead. Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com