Re: [Python-Dev] urllib unicode handling

Tom Pinckney Wed, 07 May 2008 09:13:11 -0700

Maybe I didn't understand the RFC quite right, but it seemed like howto handle hostnames was left as a choice between IDNA encoding thehostname or replacing the non-ascii characters with dashes? I guess inpractice IDNA is the right decision.

Another part I wasn't clear on is whether urllib.quote() understandsit's working on URIs, arbitrary strings, URLs or what. It seems thatfrom the documentation it looks like it's expecting to just work onthe path component of URLs. If this is so, then it doesn't need tounderstand what to do if the IRI contains a hostname.

Seems like the other somewhat under-specified part of all of this ishow urllib.unquote() should work. If after percent decoding it seesnon-ascii octets, should it try to decode them as utf-8 and if thatfails then leave them as is?


On May 7, 2008, at 11:55 AM, Robert Brewer wrote:

"Martin v. Löwis" wrote:

The proper way to implement this would be IRIs (RFC 3987),
in particular section 3.1. This is not as simple as just
encoding it as UTF-8, as you might have to apply IDNA to
the host part.

Code doing so just hasn't been contributed yet.


But if someone wanted to do so, it's pretty simple:

u'www.\u212bngstr\xf6m.com'.encode("idna")

'www.xn--ngstrm-hua5l.com'


Robert Brewer
[EMAIL PROTECTED]


_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] urllib unicode handling

Reply via email to