I was assuming urllib.quote/unquote would only be called on text intended to be used in non-hostname portions of the URIs. I'm not sure if this is the actual intent of urllib.quote and perhaps the documentation should be updated to specify what precisely it does and then peopel can decide what parts of URIs it is appropriate to quote/ unquote. I don't believe quote/unquote does anything sensical with hostnames today that contain non-printable ascii, so this is no loss of existing functionality.

Re your suggestion that IRIs should be a separate module: I guess my thought is that urllib out of the box should just work with the way websites on the web today actually work. Thus, we should make urllib do the utf-8 encode / decode rather than make users switch to a different module for certain URLs and another library for other URLs.

Re the specific issue of how urllib.unquote should work: Perhaps there could be an optional second argument that specified a content encoding to use when decoding escaped characters? I would propose that this parameter have a default value of utf-8 since that is what most websites seem to do, but if the author knew that the website they were using encoded URLs in iso-8559 then they could unquote using that scheme.

On May 7, 2008, at 3:10 PM, Martin v. Löwis wrote:

If this is indeed the case, it sounds perfectly legal (according to the RFC) and perfectly practical (as required by numerous popular websites)
to have urllib.quote and urllib.quote_plus do an automatic UTF-8
encoding of unicode strings before percent encoding them.

It's probably legal, but I don't understand why you think it's
practical. The DNS lookup then will certainly fail, no?

Regards,
Martin

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to