Bugs item #1457264, was opened at 2006-03-23 20:49 Message generated for change (Comment added) made by gbrandl You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1457264&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Python Library Group: Python 2.3 >Status: Closed >Resolution: Fixed Priority: 5 Submitted By: Steven Willis (onlynone) Assigned to: Nobody/Anonymous (nobody) Summary: urllib.splithost parses incorrectly Initial Comment: urllib.splithost(url) requires that the url passed in be of the form '//host[:port]/path'. Yet I've run across some urls that are of the form '//host[:port]?querystring'. This causes splithost to return everything as the host and nothing as the path. Section 3.2 of rfc2396 (Uniform Resource Identifiers: Generic Syntax) states that 'The authority component is preceded by a double slash "//" and is terminated by the next slash "/", question-mark "?", or by the end of the URI.' Also, this is how it defines a URI: absoluteURI = scheme ":" ( hier_part | opaque_part ) hier_part = ( net_path | abs_path ) [ "?" query ] net_path = "//" authority [ abs_path ] abs_path = "/" path_segments Based on the above, you could certainly have: 'http://authority?query' as a valid url. In python2.3 you would just need to change line 939 in urllib.py from: _hostprog = re.compile('^//([^/]*)(.*)$') to: _hostprog = re.compile('^//([^/?]*)(.*)$') This appears to affect all python versions, I just happened to be using 2.3. ---------------------------------------------------------------------- >Comment By: Georg Brandl (gbrandl) Date: 2006-03-26 21:00 Message: Logged In: YES user_id=849994 Fixed in rev. 43330. ---------------------------------------------------------------------- Comment By: Steven Willis (onlynone) Date: 2006-03-24 17:12 Message: Logged In: YES user_id=1299996 The problem I was having specifically was that the url had a colon in the query string. Since the query string was being parsed as part of the host, the text after the colon was being treated as the port when urllib.splitport was called later. The following is a simple testcase: import urllib2 webpage = urllib2.urlopen("http://host.com?a=b:3b") You will then get a "httplib.InvalidURL: nonnumeric port: '3b'" ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1457264&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com