..."

Gisle Aas Fri, 07 Apr 2000 13:41:15 -0700
[EMAIL PROTECTED] writes:

> Let's say you are running a robot over a site http://www.megacorp.com
> that has this robots.txt file:
> > User-Agent: *
> > Disallow: http://www.gigacorp.com
> > Disallow: /cgi-bin/
> 
> WWW::RobotRules interprets this as forbidding all access to
> http://www.megacorp.com.

I agree that this is wrong.

> In WWW::RobotRules::parse, the relative URL from the Disallow
> field is converted to an absolute URL with this line:
>   $disallow = URI->new($disallow, $url)->path_query;
> 
> However, URI->new("http://www.gigacorp.com",
> "http://www.megacorp.com") returns "http://www.gigacorp.com", the path
> component of which is "/".  So, $disallow above gets assigned to "/".
> In other words, the entire hierarchy is forbidden which is certainly
> not right.
> 
> I've patched my copy of RobotRules.pm by adding:
>   next if $disallow =~ /^http/;
> 
> I believe this conforms to the robots.txt description and is a good
> idea.

I don't think this will do the right thing if the same robots.txt also
was returned for http://www.megacorp.com/robots.txt.  In this case we
want to honor the Disallow-line and block all traversal of this site.

The next LWP release will address this.  My current disallow line
handling looks like this:

        elsif (/^Disallow:\s*(.*)/i) {
            unless (defined $ua) {
                warn "RobotRules: Disallow without preceding User-agent\n";
                $is_anon = 1;  # assume that User-agent: * was intended
            }
            my $disallow = $1;
            $disallow =~ s/\s+$//;
            if (length $disallow) {
                my $ignore;
                eval {
                    my $u = URI->new_abs($disallow, $robot_txt_uri);
                    $ignore++ if $u->scheme ne $robot_txt_uri->scheme;
                    $ignore++ if lc($u->host) ne lc($robot_txt_uri->host);
                    $ignore++ if $u->port ne $robot_txt_uri->port;
                    $disallow = $u->path_query;
                    $disallow = "/" unless length $disallow;
                };
                next if $@;
                next if $ignore;
            }

Regards,
Gisle
Re: WWW::RobotRules and "Disallow: http://..."

Reply via email to