In article <[EMAIL PROTECTED]>,
 John Nagle <[EMAIL PROTECTED]> wrote:

>    Python's "robots.txt" file parser may be misinterpreting a
> special case.  Given a robots.txt file like this:
> 
>       User-agent: *
>       Disallow: //
>       Disallow: /account/registration
>       Disallow: /account/mypro
>       Disallow: /account/myint
>       ...
> 
> the python library "robotparser.RobotFileParser()" considers all pages of the
> site to be disallowed.  Apparently  "Disallow: //" is being interpreted as
> "Disallow: /".  Even the home page of the site is locked out. This may be 
> incorrect.
> 
> This is the robots.txt file for "http://ibm.com";.

Hi John,
Are you sure you're not confusing your sites? The robots.txt file at 
www.ibm.com contains the double slashed path. The robots.txt file at 
ibm.com  is different and contains this which would explain why you 
think all URLs are denied:
User-agent: *
Disallow: /

I don't see the bug to which you're referring:
>>> import robotparser
>>> r = robotparser.RobotFileParser()
>>> r.set_url("http://www.ibm.com/robots.txt";)
>>> r.read()
>>> r.can_fetch("WhateverBot", "http://www.ibm.com/foo.html";)
1
>>> r.can_fetch("WhateverBot", "http://www.ibm.com//foo.html";)
0
>>> 

I'll use this opportunity to shamelessly plug an alternate robots.txt 
parser that I wrote to address some small bugs in the parser in the 
standard library. 
http://NikitaTheSpider.com/python/rerp/

Cheers

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to