On 6/20/2010 9:33 PM, P.J. Eby wrote:
At 07:33 PM 6/20/2010 -0400, Terry Reedy wrote:
Do you have in mind any tools that could and should operate on both,
but do not?

 From http://mail.python.org/pipermail/web-sig/2009-September/004105.html :

Thank for the concrete examples in this and your other post.
I am cc-ing the author of the above.

"""The problem which arises is that unquoting of URLs in Python 3.X
stdlib can only be done on unicode strings.

Actually, I believe this is an encoding rather than bytes versus unicode issue.

> If though a string
contains non UTF-8 encoded characters it can fail."""

Which is to say, I believe, if the ascii text in the (unicode) string has a % encoding of a byte that that is not a legal utf-8 encoding of anything.

The specific example is

>>> urllib.parse.parse_qsl('a=b%e0')
[('a', 'b�')]

where the character after 'b' is white ? in dark diamond, indicating an error.

parse_qsl() splits that input on '=' and sends each piece to urllib.parse.unquote unquote() attempts to "Replace %xx escapes by their single-character equivalent.". unquote has an encoding parameter that defaults to 'utf-8' in *its* call to .decode. parse_qsl does not have an encoding parameter. If it did, and it passed that to unquote, then
the above example would become (simulated interaction)

>>> urllib.parse.parse_qsl('a=b%e0', encoding='latin-1')
[('a', 'bà')]

I got that output by copying the file and adding "encoding-'latin-1'" to the unquote call.

Does this solve this problem?
Has anything like this been added for 3.2?
Should it be?

I don't have any direct experience with the specific issue demonstrated
in that post, but in the context of the discussion as a whole, I
understood the overall issue as "if you pass bytes to certain stdlib
functions, you might get back unicode, an explicit error, or (at least
in the case shown above) something that's just plain wrong."

As indicated above, I so far think that the problem is with the application of the new model, not the model itself.

Just for 'fun', I tried feeding bytes to the function.
>>> p.parse_qsl(b'a=b%e0')
Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    p.parse_qsl(b'a=b%e0')
  File "C:\Programs\Python31\lib\urllib\parse.py", line 377, in parse_qsl
    pairs = [s2 for s1 in qs.split('&') for s2 in s1.split(';')]
TypeError: Type str doesn't support the buffer API

I do not know if that message is correct, but certainly trying to split bytes with unicode is (now, at least) a mistake. This could be 'fixed' by replacing the typed literals with expressions that match the type of the input. But I am not sure if that is sensible since the next step is to unquote and decode to unicode anyway. I just do not know the use case.

Terry Jan Reedy




_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to