[EMAIL PROTECTED] writes:
> Let's say you are running a robot over a site http://www.megacorp.com
> that has this robots.txt file:
> > User-Agent: *
> > Disallow: http://www.gigacorp.com
> > Disallow: /cgi-bin/
>
> WWW::RobotRules interprets this as forbidding all access to
> http://www.megacorp.com.
I agree that this is wrong.
> In WWW::RobotRules::parse, the relative URL from the Disallow
> field is converted to an absolute URL with this line:
> $disallow = URI->new($disallow, $url)->path_query;
>
> However, URI->new("http://www.gigacorp.com",
> "http://www.megacorp.com") returns "http://www.gigacorp.com", the path
> component of which is "/". So, $disallow above gets assigned to "/".
> In other words, the entire hierarchy is forbidden which is certainly
> not right.
>
> I've patched my copy of RobotRules.pm by adding:
> next if $disallow =~ /^http/;
>
> I believe this conforms to the robots.txt description and is a good
> idea.
I don't think this will do the right thing if the same robots.txt also
was returned for http://www.megacorp.com/robots.txt. In this case we
want to honor the Disallow-line and block all traversal of this site.
The next LWP release will address this. My current disallow line
handling looks like this:
elsif (/^Disallow:\s*(.*)/i) {
unless (defined $ua) {
warn "RobotRules: Disallow without preceding User-agent\n";
$is_anon = 1; # assume that User-agent: * was intended
}
my $disallow = $1;
$disallow =~ s/\s+$//;
if (length $disallow) {
my $ignore;
eval {
my $u = URI->new_abs($disallow, $robot_txt_uri);
$ignore++ if $u->scheme ne $robot_txt_uri->scheme;
$ignore++ if lc($u->host) ne lc($robot_txt_uri->host);
$ignore++ if $u->port ne $robot_txt_uri->port;
$disallow = $u->path_query;
$disallow = "/" unless length $disallow;
};
next if $@;
next if $ignore;
}
Regards,
Gisle