Let's say you are running a robot over a site http://www.megacorp.com
that has this robots.txt file:
> User-Agent: *
> Disallow: http://www.gigacorp.com
> Disallow: /cgi-bin/

WWW::RobotRules interprets this as forbidding all access to
http://www.megacorp.com.

In WWW::RobotRules::parse, the relative URL from the Disallow
field is converted to an absolute URL with this line:
  $disallow = URI->new($disallow, $url)->path_query;

However, URI->new("http://www.gigacorp.com",
"http://www.megacorp.com") returns "http://www.gigacorp.com", the path
component of which is "/".  So, $disallow above gets assigned to "/".
In other words, the entire hierarchy is forbidden which is certainly
not right.

I've patched my copy of RobotRules.pm by adding:
  next if $disallow =~ /^http/;

I believe this conforms to the robots.txt description and is a good
idea.

Tony

Reply via email to