Maybe I didn't understand the RFC quite right, but it seemed like how to handle hostnames was left as a choice between IDNA encoding the hostname or replacing the non-ascii characters with dashes? I guess in practice IDNA is the right decision.

Another part I wasn't clear on is whether urllib.quote() understands it's working on URIs, arbitrary strings, URLs or what. It seems that from the documentation it looks like it's expecting to just work on the path component of URLs. If this is so, then it doesn't need to understand what to do if the IRI contains a hostname.

Seems like the other somewhat under-specified part of all of this is how urllib.unquote() should work. If after percent decoding it sees non-ascii octets, should it try to decode them as utf-8 and if that fails then leave them as is?

On May 7, 2008, at 11:55 AM, Robert Brewer wrote:

"Martin v. Löwis" wrote:
The proper way to implement this would be IRIs (RFC 3987),
in particular section 3.1. This is not as simple as just
encoding it as UTF-8, as you might have to apply IDNA to
the host part.

Code doing so just hasn't been contributed yet.

But if someone wanted to do so, it's pretty simple:

u'www.\u212bngstr\xf6m.com'.encode("idna")
'www.xn--ngstrm-hua5l.com'


Robert Brewer
[EMAIL PROTECTED]


_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to