Re: [Python-Dev] bytes / unicode

Terry Reedy Sun, 20 Jun 2010 20:58:14 -0700

On 6/20/2010 9:33 PM, P.J. Eby wrote:

At 07:33 PM 6/20/2010 -0400, Terry Reedy wrote:

Do you have in mind any tools that could and should operate on both,
but do not?


 From http://mail.python.org/pipermail/web-sig/2009-September/004105.html :


Thank for the concrete examples in this and your other post.
I am cc-ing the author of the above.

"""The problem which arises is that unquoting of URLs in Python 3.X
stdlib can only be done on unicode strings.

Actually, I believe this is an encoding rather than bytes versus unicodeissue.


> If though a string

contains non UTF-8 encoded characters it can fail."""

Which is to say, I believe, if the ascii text in the (unicode) stringhas a % encoding of a byte that that is not a legal utf-8 encoding ofanything.


The specific example is

>>> urllib.parse.parse_qsl('a=b%e0')
[('a', 'b�')]

where the character after 'b' is white ? in dark diamond, indicating anerror.

parse_qsl() splits that input on '=' and sends each piece tourllib.parse.unquoteunquote() attempts to "Replace %xx escapes by their single-characterequivalent.". unquote has an encoding parameter that defaults to 'utf-8'in *its* call to .decode. parse_qsl does not have an encoding parameter.If it did, and it passed that to unquote, then

the above example would become (simulated interaction)

>>> urllib.parse.parse_qsl('a=b%e0', encoding='latin-1')
[('a', 'bà')]

I got that output by copying the file and adding "encoding-'latin-1'" tothe unquote call.


Does this solve this problem?
Has anything like this been added for 3.2?
Should it be?

I don't have any direct experience with the specific issue demonstrated
in that post, but in the context of the discussion as a whole, I
understood the overall issue as "if you pass bytes to certain stdlib
functions, you might get back unicode, an explicit error, or (at least
in the case shown above) something that's just plain wrong."

As indicated above, I so far think that the problem is with theapplication of the new model, not the model itself.


Just for 'fun', I tried feeding bytes to the function.
>>> p.parse_qsl(b'a=b%e0')
Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    p.parse_qsl(b'a=b%e0')
  File "C:\Programs\Python31\lib\urllib\parse.py", line 377, in parse_qsl
    pairs = [s2 for s1 in qs.split('&') for s2 in s1.split(';')]
TypeError: Type str doesn't support the buffer API

I do not know if that message is correct, but certainly trying to splitbytes with unicode is (now, at least) a mistake. This could be 'fixed'by replacing the typed literals with expressions that match the type ofthe input. But I am not sure if that is sensible since the next step isto unquote and decode to unicode anyway. I just do not know the use case.


Terry Jan Reedy




_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

Reply via email to