[ python-Bugs-1712522 ] urllib.quote throws exception on Unicode URL

SourceForge.net Wed, 13 Jun 2007 08:36:45 -0700

Bugs item #1712522, was opened at 2007-05-04 06:11
Message generated for change (Comment added) made by varmaa
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1712522&group_id=5470


Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Nagle (nagle)
Assigned to: Nobody/Anonymous (nobody)
Summary: urllib.quote throws exception on Unicode URL

Initial Comment:
The code in urllib.quote fails on Unicode input, when
called by robotparser with a Unicode URL.

Traceback (most recent call last):
File "./sitetruth/InfoSitePage.py", line 415, in run
pagetree = self.httpfetch() # fetch page
File "./sitetruth/InfoSitePage.py", line 368, in httpfetch
if not self.owner().checkrobotaccess(self.requestedurl) : # if access 
disallowed by robots.txt file
File "./sitetruth/InfoSiteContent.py", line 446, in checkrobotaccess
return(self.robotcheck.can_fetch(config.kuseragent, url)) # return can fetch
File "/usr/local/lib/python2.5/robotparser.py", line 159, in can_fetch
url = urllib.quote(urlparse.urlparse(urllib.unquote(url))[2]) or "/"
File "/usr/local/lib/python2.5/urllib.py", line 1197, in quote
res = map(safe_map.__getitem__, s)
KeyError: u'\xe2'

   That bit of code needs some attention.  
- It still assumes ASCII goes up to 255, which hasn't been true in Python for a 
while now.
- The initialization may not be thread-safe; a table is being initialized on 
first use.

"robotparser" was trying to check if a URL with a Unicode character in it was 
allowed.  Note the "KeyError: u'\xe2'" 

----------------------------------------------------------------------

Comment By: Atul Varma (varmaa)
Date: 2007-06-13 15:36

Message:
Logged In: YES 
user_id=863202
Originator: NO

It should be noted that the unicode aspect of this bug is actually a
recognized flaw with a nontrivial solution.  See this thread from the
Python-dev list, dated from July 2006:

http://mail.python.org/pipermail/python-dev/2006-July/067248.html

It was essentially agreed upon in this thread that the "obvious"
solution--simply converting to UTF-8 as per rfc3986--doesn't actually cover
all cases, and that passing a unicode string in to urllib.quote() indeed
has ambiguous results.  For more information, see Mike Brown's comment on
the aforementioned thread:

http://mail.python.org/pipermail/python-dev/2006-July/067335.html

It was generally agreed in the thread that the proper solution was to have
urllib.quote() *only* deal with standard Python string data, and to raise a
TypeError if a unicode string is passed in, implying that any conversion
needs to be done by higher-level code, because implicit conversion within
urllib.quote() is too ambiguous.

However, it seems the TypeError fix was never made to the Python SVN
repository; perhaps this is because it may have broken legacy code that
actually catches KeyErrors as John Nagle mentioned?  Or perhaps it was
simply because no one ever got around to it.  Unfortunately, I'm not in a
position to say for sure, but I hope my explanation helps.


----------------------------------------------------------------------

Comment By: John Nagle (nagle)
Date: 2007-06-06 16:49

Message:
Logged In: YES 
user_id=5571
Originator: YES

As a workaround, you can surround calls to "can_fetch" with an try-block
and catch KeyError exceptions.  That's what I'm doing.  

----------------------------------------------------------------------

Comment By: Collin Winter (collinwinter)
Date: 2007-06-05 23:39

Message:
Logged In: YES 
user_id=1344176
Originator: NO

Could you possibly provide a patch to fix this?

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1712522&group_id=5470
_______________________________________________
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[ python-Bugs-1712522 ] urllib.quote throws exception on Unicode URL

Reply via email to