New submission from Bill Winslow: The following code will produce a UnicodeEncodeError about a character being non-ascii:
from urllib import request, parse, error url = 'http://en.wikipedia.org/wiki/Antonio Vallejo-Nájera' req = request.Request(url) response = request.urlopen(req) This fails as follows: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.3/urllib/request.py", line 156, in urlopen return opener.open(url, data, timeout) File "/usr/lib/python3.3/urllib/request.py", line 469, in open response = self._open(req, data) File "/usr/lib/python3.3/urllib/request.py", line 487, in _open '_open', req) File "/usr/lib/python3.3/urllib/request.py", line 447, in _call_chain result = func(*args) File "/usr/lib/python3.3/urllib/request.py", line 1268, in http_open return self.do_open(http.client.HTTPConnection, req) File "/usr/lib/python3.3/urllib/request.py", line 1248, in do_open h.request(req.get_method(), req.selector, req.data, headers) File "/usr/lib/python3.3/http/client.py", line 1067, in request self._send_request(method, url, body, headers) File "/usr/lib/python3.3/http/client.py", line 1095, in _send_request self.putrequest(method, url, **skips) File "/usr/lib/python3.3/http/client.py", line 959, in putrequest self._output(request.encode('ascii')) UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in position 27: ordinal not in range(128) I examined the library code in question: line 958 in http/client.py, the line before the one that barfs, contains the following comment: # Non-ASCII characters should have been eliminated earlier I added a print statement to the library code: print(request) self._output(request.encode('ascii')) This prints the following: >>> response = request.urlopen(req) GET /wiki/Antonio Vallejo-Nájera HTTP/1.1 Traceback (most recent call last): ... I confirmed that the 27th character as mentioned in the traceback is in fact the á in the last name. Clearly either urllib or http is not properly sanitizing the url -- unfortunately, my knowledge is useless as to determining where the actual error is; hopefully this report contains enough detail to make it easy enough. ---------- components: Library (Lib), Unicode messages: 210587 nosy: Dubslow, ezio.melotti, haypo priority: normal severity: normal status: open title: urllib/http fail to sanitize a non-ascii url type: behavior versions: Python 3.3 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue20559> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com