Bugs item #1712522, was opened at 2007-05-04 06:11 Message generated for change (Comment added) made by varmaa You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1712522&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Python Library Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: John Nagle (nagle) Assigned to: Nobody/Anonymous (nobody) Summary: urllib.quote throws exception on Unicode URL Initial Comment: The code in urllib.quote fails on Unicode input, when called by robotparser with a Unicode URL. Traceback (most recent call last): File "./sitetruth/InfoSitePage.py", line 415, in run pagetree = self.httpfetch() # fetch page File "./sitetruth/InfoSitePage.py", line 368, in httpfetch if not self.owner().checkrobotaccess(self.requestedurl) : # if access disallowed by robots.txt file File "./sitetruth/InfoSiteContent.py", line 446, in checkrobotaccess return(self.robotcheck.can_fetch(config.kuseragent, url)) # return can fetch File "/usr/local/lib/python2.5/robotparser.py", line 159, in can_fetch url = urllib.quote(urlparse.urlparse(urllib.unquote(url))[2]) or "/" File "/usr/local/lib/python2.5/urllib.py", line 1197, in quote res = map(safe_map.__getitem__, s) KeyError: u'\xe2' That bit of code needs some attention. - It still assumes ASCII goes up to 255, which hasn't been true in Python for a while now. - The initialization may not be thread-safe; a table is being initialized on first use. "robotparser" was trying to check if a URL with a Unicode character in it was allowed. Note the "KeyError: u'\xe2'" ---------------------------------------------------------------------- Comment By: Atul Varma (varmaa) Date: 2007-06-13 15:36 Message: Logged In: YES user_id=863202 Originator: NO It should be noted that the unicode aspect of this bug is actually a recognized flaw with a nontrivial solution. See this thread from the Python-dev list, dated from July 2006: http://mail.python.org/pipermail/python-dev/2006-July/067248.html It was essentially agreed upon in this thread that the "obvious" solution--simply converting to UTF-8 as per rfc3986--doesn't actually cover all cases, and that passing a unicode string in to urllib.quote() indeed has ambiguous results. For more information, see Mike Brown's comment on the aforementioned thread: http://mail.python.org/pipermail/python-dev/2006-July/067335.html It was generally agreed in the thread that the proper solution was to have urllib.quote() *only* deal with standard Python string data, and to raise a TypeError if a unicode string is passed in, implying that any conversion needs to be done by higher-level code, because implicit conversion within urllib.quote() is too ambiguous. However, it seems the TypeError fix was never made to the Python SVN repository; perhaps this is because it may have broken legacy code that actually catches KeyErrors as John Nagle mentioned? Or perhaps it was simply because no one ever got around to it. Unfortunately, I'm not in a position to say for sure, but I hope my explanation helps. ---------------------------------------------------------------------- Comment By: John Nagle (nagle) Date: 2007-06-06 16:49 Message: Logged In: YES user_id=5571 Originator: YES As a workaround, you can surround calls to "can_fetch" with an try-block and catch KeyError exceptions. That's what I'm doing. ---------------------------------------------------------------------- Comment By: Collin Winter (collinwinter) Date: 2007-06-05 23:39 Message: Logged In: YES user_id=1344176 Originator: NO Could you possibly provide a patch to fix this? ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1712522&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com