Hi Python devs, I am trying to download an HTML document. I get an HTTP 301 (Moved Permanently) with a UTF-8 encoded Location header and http.client decodes it as iso-8859-1. When there's a non-ASCII character in the redirect URL then I can't download the document.
In client.py def parse_headers() I see the call to decode('iso-8859-1'). My personal hack is to use whatever charset is defined in the Content-Type HTTP header (utf8) or fall back into iso-8859-1. At this point I am not sure where/how a fix should occur so I thought I'd run it by you in case I should file a bug. Note that I don't use http.client directly, but through the python-requests library. I include some code to reproduce the problem below. Cheers, Hugo ----- #!/usr/bin/env python3 # Trying to replicate what wget does with a 301 redirect: # wget --server-response www.starbucks.com/store/158/AT/Karntnerstrasse/K%c3%a4rntnerstrasse-49-Vienna-9-1010 import http.client import urllib.parse s2='/store/158/AT/Karntnerstrasse/K%c3%a4rntnerstrasse-49-Vienna-9-1010' s3=' http://www.starbucks.com/store/158/at/karntnerstrasse/k%C3%A4rntnerstrasse-49-vienna-9-1010 ' conn = http.client.HTTPConnection('www.starbucks.com') conn.request('GET', s2) r = conn.getresponse() print('Location', r.headers.get('Location')) print('Expected', urllib.parse.unquote(s3)) assert r.status == 301 assert r.headers.get('Location') == urllib.parse.unquote(s3), \ 'decoded as iso-8859-1 instead of utf8' conn = http.client.HTTPConnection('www.starbucks.com') conn.request('GET', s3) r = conn.getresponse() assert r.status == 200
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com