In article <[EMAIL PROTECTED]>, John Nagle <[EMAIL PROTECTED]> wrote:
> Python's "robots.txt" file parser may be misinterpreting a > special case. Given a robots.txt file like this: > > User-agent: * > Disallow: // > Disallow: /account/registration > Disallow: /account/mypro > Disallow: /account/myint > ... > > the python library "robotparser.RobotFileParser()" considers all pages of the > site to be disallowed. Apparently "Disallow: //" is being interpreted as > "Disallow: /". Even the home page of the site is locked out. This may be > incorrect. > > This is the robots.txt file for "http://ibm.com". Hi John, Are you sure you're not confusing your sites? The robots.txt file at www.ibm.com contains the double slashed path. The robots.txt file at ibm.com is different and contains this which would explain why you think all URLs are denied: User-agent: * Disallow: / I don't see the bug to which you're referring: >>> import robotparser >>> r = robotparser.RobotFileParser() >>> r.set_url("http://www.ibm.com/robots.txt") >>> r.read() >>> r.can_fetch("WhateverBot", "http://www.ibm.com/foo.html") 1 >>> r.can_fetch("WhateverBot", "http://www.ibm.com//foo.html") 0 >>> I'll use this opportunity to shamelessly plug an alternate robots.txt parser that I wrote to address some small bugs in the parser in the standard library. http://NikitaTheSpider.com/python/rerp/ Cheers -- Philip http://NikitaTheSpider.com/ Whole-site HTML validation, link checking and more -- http://mail.python.org/mailman/listinfo/python-list