Hi folks, This issue got some attention a few weeks back but it seems to have fallen quiet, and I haven't had a good chance to sit down and reply again till now.
As I've said before this is a serious issue which will affect a great deal of code. However it's obviously not as clear-cut as I originally believed, since there are lots of conflicting opinions. Let us see if we can come to a consensus. (For those who haven't seen the discussion, the thread starts here: http://mail.python.org/pipermail/python-dev/2008-July/081013.html continues here for some reason: http://mail.python.org/pipermail/python-dev/2008-July/081066.html and I've got a bug report with a fully tested and documented patch here: http://bugs.python.org/issue3300) Firstly, it looks like most of the people agree we should add an optional "encoding" argument which lets the caller customize which encoding to use. What we tend to disagree about is what the default encoding should be. Here I present the various options as I see it (and I'm trying to be impartial), and the people who've indicated support for that option (apologies if I've misrepresented anybody's opinion, feel free to correct): 1. Leave it as it is. quote is Latin-1 if range(0,256), fallback to UTF-8. unquote is Latin-1. In favour: Anybody who doesn't reply to this thread Pros: Already implemented; some existing code depends upon ord values of string being the same as they were for byte strings; possible to hack around it. Cons: unquote is not inverse of quote; quote behaviour internally-inconsistent; garbage when unquoting UTF-8-encoded URIs. 2. Default to UTF-8. In favour: Matt Giuca, Brett Cannon, Jeroen Ruigrok van der Werven Pros: Fully working and tested solution is implemented; recommended by RFC 3986 for all future schemes; recommended by W3C for use with HTML; UTF-8 used by all major browsers; supports all characters; most existing code compatible by default; unquote is inverse of quote. Cons: By default, URIs may have invalid octet sequences (not possible to reverse). 3. quote default to UTF-8, unquote default to Latin-1. In favour: André Malo Pros: quote able to handle all characters; unquote able to handle all sequences. Cons: unquote is not inverse of quote; totally inconsistent. 4. quote accepts either bytes or str, unquote default to outputting bytes unless given an encoding argument. In favour: Bill Janssen Pros: Technically does what the spec says, which is treat it as an octet encoding. Cons: unquote will break most existing code; almost 100% of the time people will want it as a string. </impartiality> I'll just comment on #4 since I haven't already. Let's talk about quote and unquote separately. For quote, I'm all for letting it accept a bytes as well as a str. That doesn't break anything or surprise anyone. For unquote, I think it will break a lot and surprise everyone. I think that while this may be "purely" the best option, it's pretty silly. I reckon the vast majority of users will be surprised when they see it spitting out a bytes object, and all that most people will do is decode it as UTF-8. Besides, while you're reading the RFCs as "URLs specify a method for encoding octet sequences", I'm reading them as "URLs specify a method for encoding strings, and leave the character encoding unspecified." The second reading supports the idea that unquote outputs a str. I'm also recommending we add unquote_to_bytes to do what you suggest unquote should do. (So either way we'll get both versions of unquote; I'm just suggesting the one called "unquote" do the thing everybody expects). But that's less of a priority so I want to commit these urgent fixes first. I'm basically saying just two things: 1. The standards are undefined; 2. Therefore we should pick the most useful and/or intuitive default. IMHO choosing UTF-8 *is* the most useful AND intuitive, and will be more so in the future when more technologies are hard-coded as UTF-8 (which this RFC recommends they do in the future). I am also quite adamant that unquote be the inverse of quote. Are there any more opinions on this matter? It would be good to reach a consensus. If anyone seriously wants to push a different alternative to mine, please write a working implementation and attach it to issue 3300. On the technical side of things, does anybody have time to review my patch for this issue? http://bugs.python.org/issue3300 Patch 5. It's just a patch for unquote, quote, and small related functions, as well as numerous changes to test cases and documentation. Cheers Matt _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com