New submission from Brian Slesinsky <br...@slesinsky.org>: If a robots.txt file contains a rule of the form:
Disallow: /some/path?name=value This pattern will never match a URL passed to can_fetch(), as far as I can tell. It's arguable whether this is a bug. The 1994 robots.txt protocol is silent on whether to treat query strings specially and just says "any URL that starts with this value will not be retrieved". The 1997 draft standard talks about the path portion of a URL but doesn't give any examples about how to treat the '?' character in a robots.txt pattern. Google extends the protocol to allow wildcard characters in a way that doesn't treat the '?' character specially. See: http://www.google.com/support/webmasters/bin/answer.py?answer=40360&cbid=-1rdq1gi8f11xx&src=cb&lev=answer#3 I'll leave aside whether to implement pattern matching, but it seems like a good idea to do something reasonable when a robots.txt pattern contains a literal '?', and treating it as a literal character seems simplest. Cause: in robotparser.can_fetch(), there is this code which seems to take only the path (stripping the query string). url = urllib.quote(urlparse.urlparse(urllib.unquote(url))[2]) or "/" Also, when parsing patterns in the robots.txt file, a '?' character seems to be automatically URL-escaped. There's nothing in a standards doc about doing this so I think that might be a bug too. Tested with python 2.4. I looked at the code in Subversion head and it doesn't look like there were any changes on the trunk. ---------- components: Library (Lib) messages: 89622 nosy: skybrian severity: normal status: open title: robotparser doesn't handle URL's with query strings type: behavior versions: Python 2.4, Python 2.5, Python 2.6, Python 2.7 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue6325> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com