Looks like a guy on BayPiggies just sent me straight. The short
answer is that THE RIGHT THING is for me to call .encode before
urlencode gets called. It would be inappropriate to put the .encode
in urlencode. Here's the long answer:
Forwarded Conversation
Subject: urllib.urlencode and encoding
------------------------
From: Shannon -jj Behrens <[EMAIL PROTECTED]>
To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
Cc: David Loftesness <[EMAIL PROTECTED]>
Date: Wed, Apr 18, 2007 at 4:25 PM
I noticed that urllib.urlencode does the right thing (i.e. it uses
%xx) if you .encode('utf-8') the parameters first. I'm wondering if
it makes sense for urllib.urlencode to automatically encode Unicode
objects in this case. I haven't had much luck getting changes into
Python, so I was going to solicit comments here first.
Thanks,
-jj
--
"'Software Engineering' is something of an oxymoron. It's very
difficult to have real engineering before you have physics, and there
isn't anything even close to a physics for software." -- L. Peter
Deutsch
--------
From: Keith Dart ♂ <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Date: Wed, Apr 18, 2007 at 4:35 PM
Shannon -jj Behrens wrote the following on 2007-04-18 at 16:25 PDT:
===
> I noticed that urllib.urlencode does the right thing (i.e. it uses
> %xx) if you .encode('utf-8') the parameters first. I'm wondering if
> it makes sense for urllib.urlencode to automatically encode Unicode
> objects in this case. I haven't had much luck getting changes into
> Python, so I was going to solicit comments here first.
===
Yes, I think so. That's wny I use the module urlparseplus.
http://pynms.googlecode.com/svn/trunk/lib/urlparseplus.py
--
-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Keith Dart <[EMAIL PROTECTED]>
public key: ID: 19017044
<http://www.dartworks.biz/>
=====================================================================
_______________________________________________
Baypiggies mailing list
[EMAIL PROTECTED]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
--------
From: Tung Wai Yip <[EMAIL PROTECTED]>
To: Shannon -jj Behrens <[EMAIL PROTECTED]>, "[EMAIL PROTECTED]"
<[EMAIL PROTECTED]>
Cc: David Loftesness <[EMAIL PROTECTED]>
Date: Wed, Apr 18, 2007 at 4:51 PM
I may not have the complete context of your question. So I might be
suggesting something different.
I think you want to encoded unicode characters into a query string or the
URI. What you are doing is right. Not only do you have to encode a string
in UTF-8 first, you also need a complementary UTF-8 decoding on the CGI
side.
urllib.urlencode() cannot encode unicode string itself. RFC 2396 has not
taken unicode into consideration. So there is no rule on what to do with
unicode in an URI. It is up to the application to decide on the encoding,
e.g. UTF-8 first, url encoding next. Others might very well choose to use
UTF-16 instead.
Wai Yip
[Quoted text hidden]
[Quoted text hidden]
--------
From: Keith Dart ♂ <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Date: Wed, Apr 18, 2007 at 5:08 PM
Tung Wai Yip wrote the following on 2007-04-18 at 16:51 PDT:
===
> urllib.urlencode() cannot encode unicode string itself. RFC 2396 has not
> taken unicode into consideration. So there is no rule on what to do with
> unicode in an URI. It is up to the application to decide on the encoding,
> e.g. UTF-8 first, url encoding next. Others might very well choose to use
> UTF-16 instead.
===
Nope, see RFC 3986:
Network Working Group T. Berners-Lee
Request for Comments: 3986 W3C/MIT
STD: 66 R. Fielding
Updates: 1738 Day Software
Obsoletes: 2732, *2396*, 1808
Section 2.5:
When a new URI scheme defines a component that represents textual
data consisting of characters from the Universal Character Set [UCS],
the data should first be encoded as octets according to the UTF-8
character encoding [STD63]; then only those octets that do not
correspond to characters in the unreserved set should be percent-
encoded. For example, the character A would be represented as "A",
the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
as "%C3%80", and the character KATAKANA LETTER A would be represented
as "%E3%82%A2".
--
-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Keith Dart <[EMAIL PROTECTED]>
public key: ID: 19017044
<http://www.dartworks.biz/>
=====================================================================
_______________________________________________
[Quoted text hidden]
--------
From: David Reid <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Date: Wed, Apr 18, 2007 at 9:15 PM
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi folks,
On Apr 18, 2007, at 5:08 PM, Keith Dart ♂ wrote:
> > When a new URI scheme defines a component that represents textual
> > data consisting of characters from the Universal Character Set
> > [UCS],
> > the data should first be encoded as octets according to the UTF-8
> > character encoding [STD63]; then only those octets that do not
> > correspond to characters in the unreserved set should be percent-
> > encoded. For example, the character A would be represented as "A",
> > the character LATIN CAPITAL LETTER A WITH GRAVE would be
> > represented
> > as "%C3%80", and the character KATAKANA LETTER A would be
> > represented
> > as "%E3%82%A2".
The key piece of information here is "When a new URI scheme" the RFC
(AFAICT) makes no mention of what to do about old schemes, such as
HTTP. In fact the HTML4 spec makes it's own claims as to %-encoded
as a result of form submission:
http://www.w3.org/TR/html4/interact/forms.html
accept-charset = charset list [CI]
This attribute specifies the list of character encodings for
input data that is accepted by the server processing this form. The
value is a space- and/or comma-delimited list of charset values. The
client must interpret this list as an exclusive-or list, i.e., the
server is able to accept any single character encoding per entity
received.
The default value for this attribute is the reserved string
"UNKNOWN". User agents may interpret this value as the character
encoding that was used to transmit the document containing this FORM
element.
So I think it's still incorrect for urllib to make any such
assumptions as to the data being UTF-8. (Though I hope it won't be in
the future.)
- -David
http://dreid.org
"Usually the protocol is this: I appoint someone for a task,
which they are not qualified to do. Then, they have to fight
a bear if they don't want to do it." -- Glyph Lefkowitz
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGJuzDrsrO6aeULcgRAhdNAJ9VeVkbPXC/eMvOTmEmgWT2vhzoewCgnmbL
ZG5/CIcdtV44ojqefbo+4cw=
=K+T2
-----END PGP SIGNATURE-----
[Quoted text hidden]
--------
From: David Reid <[EMAIL PROTECTED]>
To: Keith Dart <[EMAIL PROTECTED]>
Cc: Python <[EMAIL PROTECTED]>
Date: Thu, Apr 19, 2007 at 9:14 AM
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On Apr 18, 2007, at 11:17 PM, Keith Dart wrote:
> On Wed, 18 Apr 2007 21:15:34 -0700
> David Reid <[EMAIL PROTECTED]> wrote:
>
>> So I think it's still incorrect for urllib to make any such
>> assumptions as to the data being UTF-8. (Though I hope it won't be in
>> the future.)
>
> The RFC, and the previous discussion, have nothing to do with the
> content (data) encoding. It's only concerned with the URL encoding.
The relevant section of the HTML4 forms spec is concerned with the
URL encoding if the URL is generated by the browser as part of a form
submission. So I'm still gonna have to go with it being pretty much
completely wrong for urllib to make any assumptions about the charset
of %-encoded data (either in a url segment or in query args.) Not
that life wouldn't be much nicer if everything weren't UTF-8, but the
world isn't that nice to begin with.
- -David
http://dreid.org
"Usually the protocol is this: I appoint someone for a task,
which they are not qualified to do. Then, they have to fight
a bear if they don't want to do it." -- Glyph Lefkowitz
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Darwin)
iD8DBQFGJ5PvrsrO6aeULcgRAouTAJ49/rpNFGIxA7rJdR/h8ItKCmszkgCggSua
eXILt7KtfK6+MAEVZRT5Hjs=
=7SSs
[Quoted text hidden]
--------
From: Tung Wai Yip <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Date: Thu, Apr 19, 2007 at 10:38 AM
> On Apr 18, 2007, at 5:08 PM, Keith Dart ♂ wrote:
>> > When a new URI scheme defines a component that represents textual
>> > data consisting of characters from the Universal Character Set
>> > [UCS],
>> > the data should first be encoded as octets according to the UTF-8
>> > character encoding [STD63]; then only those octets that do not
>> > correspond to characters in the unreserved set should be percent-
>> > encoded. For example, the character A would be represented as "A",
>> > the character LATIN CAPITAL LETTER A WITH GRAVE would be
>> > represented
>> > as "%C3%80", and the character KATAKANA LETTER A would be
>> > represented
>> > as "%E3%82%A2".
Thanks Keith for the heads up. One issue I regularly have is to track down
the lineage of RFCs. When I found RFC X, I am often not aware of a RFC Y
that supersede it. It doesn't help that historically there are many
documents pointing to RFC X. But from RFC X itself it has no link to RFC
Y. Try to follow the link from the bottom of the urlparse module
documentation. It does not lead to RFC 3986.
http://docs.python.org/lib/module-urlparse.html
On Wed, 18 Apr 2007 21:15:34 -0700, David Reid <[EMAIL PROTECTED]> wrote:
> The key piece of information here is "When a new URI scheme" the RFC
> (AFAICT) makes no mention of what to do about old schemes, such as
> HTTP. In fact the HTML4 spec makes it's own claims as to %-encoded
> as a result of form submission:
>
> http://www.w3.org/TR/html4/interact/forms.html
>
> accept-charset = charset list [CI]
> This attribute specifies the list of character encodings for
> input data that is accepted by the server processing this form. The
> value is a space- and/or comma-delimited list of charset values. The
> client must interpret this list as an exclusive-or list, i.e., the
> server is able to accept any single character encoding per entity
> received.
> The default value for this attribute is the reserved string
> "UNKNOWN". User agents may interpret this value as the character
> encoding that was used to transmit the document containing this FORM
> element.
>
> So I think it's still incorrect for urllib to make any such
> assumptions as to the data being UTF-8. (Though I hope it won't be in
> the future.)
>
> - -David
> http://dreid.org
I think RFC 3986 says a character should be encoded in UTF-8 only if it is
from the UCS. But it is also legitimate to use other character set, for
example as in the HTML4 spec David has pointed out. Say you are writing a
screen scrapper for a Japanese website you should use the character
encoding the website expects, which is not necessary UTF-8.
Wai Yip
[Quoted text hidden]
--------
From: Tung Wai Yip <[EMAIL PROTECTED]>
To: David Reid <[EMAIL PROTECTED]>, Keith Dart <[EMAIL PROTECTED]>
Cc: Python <[EMAIL PROTECTED]>
Date: Thu, Apr 19, 2007 at 11:14 AM
> On Apr 18, 2007, at 11:17 PM, Keith Dart wrote:
>
>> On Wed, 18 Apr 2007 21:15:34 -0700
>> David Reid <[EMAIL PROTECTED]> wrote:
>>
>>> So I think it's still incorrect for urllib to make any such
>>> assumptions as to the data being UTF-8. (Though I hope it won't be in
>>> the future.)
>>
>> The RFC, and the previous discussion, have nothing to do with the
>> content (data) encoding. It's only concerned with the URL encoding.
>
> The relevant section of the HTML4 forms spec is concerned with the
> URL encoding if the URL is generated by the browser as part of a form
> submission. So I'm still gonna have to go with it being pretty much
> completely wrong for urllib to make any assumptions about the charset
> of %-encoded data (either in a url segment or in query args.) Not
> that life wouldn't be much nicer if everything weren't UTF-8, but the
> world isn't that nice to begin with.
>
> - -David
> http://dreid.org
Here is an example. The key parameter is BIG-5 encoded. Welcome to the
tower of babel!
http://search.books.com.tw/exep/prod_search.php?cat=all&key=%A5i%B7R%A4O%B6q%A4j&image233223.x=13&image233223.y=10
Wai Yip
[Quoted text hidden]
--------
From: Shannon -jj Behrens <[EMAIL PROTECTED]>
To: Tung Wai Yip <[EMAIL PROTECTED]>
Cc: [EMAIL PROTECTED], David Loftesness <[EMAIL PROTECTED]>
Date: Thu, Apr 19, 2007 at 11:22 AM
[Quoted text hidden]Ok, thanks for all your comments guys. David,
thanks for the RFC
quotes. If I am to understand things correctly, because the rest of
my page is all working correctly using UTF-8, I can .encode('UTF-8')
parameters before passing them to urlencode. However, it doesn't make
sense to put that .encode inside urlencode.
>Welcome to the tower of babel!
I was reading <http://www.mozilla.org/docs/web-developer/faq.html#accept>
the other day, and I was pondering the fact that we can't even agree
on versions of HTML. Mozilla *still* recommends HTML 4.01 over XHTML.
Since HTML is a language used to transport content, I recognized that
this too was a case of the Tower of Babel. Upon realizing this, in my
head, I heard a little voice say, "Gotcha!"
*sigh*
[Quoted text hidden]
--------
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"pylons-discuss" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at
http://groups.google.com/group/pylons-discuss?hl=en
-~----------~----~----~----~------~----~------~--~---