On Wed, Jul 30, 2008 at 8:49 PM, Matt Giuca <[EMAIL PROTECTED]> wrote: > >> Con: URI encoding does not encode characters. > > OK, for all the people who say URI encoding does not encode characters: yes > it does. This is not an encoding for binary data, it's an encoding for > character data, but it's unspecified how the strings map to octets before > being percent-encoded. From RFC 3986, section 1.2.1: > >> Percent-encoded octets (Section 2.1) may be used within a URI to represent >> characters outside the range of the US-ASCII coded character set if this >> representation is allowed by the scheme or by the protocol element in which >> the URI is referenced. Such a definition should specify the character >> encoding used to map those characters to octets prior to being >> percent-encoded for the URI. > > So the string->string proposal is actually correct behaviour. I'm all in > favour of a bytes->string version as well, just not with the names "quote" > and "unquote". > > I'll prepare a new patch shortly which has bytes->string and string->bytes > versions of the functions as well. (quote will accept either type, while > unquote will output a str, there will be a new function unquote_to_bytes > which outputs a bytes - is everyone happy with that?)
I'd rather have two pairs of functions, so that those who want to give the readers of their code a clue can do so. I'm not opposed to having redundant functions that accept either string or bytes though, unless others prefer not to. > Guido says: >> >> Actually, we'd need to look at the various other APIs in Py3k before we >> can decide whether these should be considered taking or returning bytes or >> text. It looks like all other APIs in the Py3k version of urllib treat URLs >> as text. > > Yes, as I said in the bug tracker, I've groveled over the entire stdlib to > see how my patch affects the behaviour of dependent code. Aside from a few > minor bits which assumed octets (and did their own encoding/decoding) (which > I fixed), all the code assumes strings and is very happy to go on assuming > this, as long as the URIs are encoded with UTF-8, which they almost > certainly are. Sorry, I have yet to look at the tracker (only so many minutes in a day...). > Guido says: >> >> I think the only change is to remove the encoding arguments and ... > > You really want me to remove the encoding= named argument? And hard-code > UTF-8 into these functions? It seems like we may as well have the optional > encoding argument, as it does no harm and could be of significant benefit. > I'll post a patch with the unquote_to_bytes function, but leave the encoding > arguments in until this point is clarified. I don't mind an encoding argument, as long as it isn't used to change the return type (as Bill was proposing). -- --Guido van Rossum (home page: http://www.python.org/~guido/) _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com