[issue1712522] urllib.quote throws exception on Unicode URL

Matt Giuca Sun, 14 Mar 2010 00:46:45 -0800

Matt Giuca <matt.gi...@gmail.com> added the comment:

I've finally gotten around to a complete analysis of this code. I have a 
code/test/documentation patch which fixes the issue without any code breakage.


There is another bug in quote which I've found and fixed with this patch: If 
the 'safe' parameter is unicode, it raises a UnicodeDecodeError.

I have backported all of the 'quote' test cases from Python 3 (which I wrote) 
to Python 2. This exposed the reported bug as well as the above one. It's good 
to have a much larger set of test cases to work with. It tests things like all 
combinations of str/unicode, as well as non-ASCII byte string input and all 
manner of unicode inputs.

The bugfix itself comes from Python 3 (this has already been approved, over 
many months, by Guido, so I am hoping a similar change can get pushed through 
into Python 2 fairly easily). The solution is to add "encoding" and "errors" 
arguments to 'quote', and have quote encode the unicode string before anything 
else. 'encoding' defaults to 'utf-8'. So:

>>> quote(u'/El Niño/')
'/El%20Ni%C3%B1o/'

which is typically the desired behaviour. (Note that URI syntax does not cover 
Unicode strings; it merely says to encode them with some encoding, recommended 
but not required UTF-8, and then percent-encode those.)

With this patch, quote *always* returns a str, even on unicode input. I think 
that makes sense, because a URI is, by definition, an ASCII string. It could 
easily be made to return a unicode instead.

The other fix is for 'safe'. If 'safe' is a byte string we don't touch it. But 
if it is a Unicode string, we throw away all non-ASCII bytes. This means you 
can't make *characters* safe, only *bytes*, since URIs deal with bytes. In 
Python 3, we go further and throw away all non-ASCII bytes from 'safe' as well, 
so you can only make ASCII bytes safe. For this patch, I didn't go that far, 
for backwards compatibility reasons.

Also updated documentation.

In summary, this patch makes 'quote' fully Unicode compliant. It does not 
change any existing behaviour which wouldn't previously have thrown an 
exception, so it can't possibly break any existing code (unless it's relying on 
the exception being thrown).

(A minor change I made was replacing the line "cachekey = (safe, always_safe)" 
with "cachekey = safe". This avoids unnecessary work of hashing always_safe and 
the tuple, since always_safe doesn't change. It doesn't affect the behaviour.)

Note: I've also backported the 'unquote' test cases from Python 3 and found a 
few more bugs. I'm going to report them separately, with patches.

----------
keywords: +patch
Added file: http://bugs.python.org/file16539/urllib-quote.patch

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue1712522>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1712522] urllib.quote throws exception on Unicode URL

Reply via email to