Craig Macdonald <[EMAIL PROTECTED]> writes:

> Hi, just a short note to suggest a 1-line change to WWW::RobotRules.
> 
> When loading, http://www.maths.gla.ac.uk/robots.txt I noticed
> WWW::RobotRules giving me warnings:
> 
> RobotRules: Unexpected line:      User-agent: *
> RobotRules: Unexpected line:      Disallow: /error/
> RobotRules: Unexpected line:      Disallow: /tla_review/
> 
> etc.
> 
> The problem is that WWW::RobotRules doesn't support leading space on a
> robots.txt line. As such, I would suggest adding
> s/^\s*//;
> at line 51 of RobotRules.pm.
> 
> I'm not sure how frequent a problem this might be, but it seems
> important to make WWW::RobotRules as robust at parsing robots.txt files
> as possible, in order to prevent parts of sites being crawled that
> shouldn't be.

The spec at <http://www.robotstxt.org/wc/norobots.html> states that
leading space is not allowed, but I agree that LWP should be a bit
more liberal when parsing.  I've now applied the following patch.

Regads,
Gisle


Index: lib/LWP/RobotUA.pm
===================================================================
RCS file: /cvsroot/libwww-perl/lwp5/lib/LWP/RobotUA.pm,v
retrieving revision 1.24
diff -u -p -r1.24 RobotUA.pm
--- lib/LWP/RobotUA.pm  6 Apr 2004 11:02:50 -0000       1.24
+++ lib/LWP/RobotUA.pm  6 Apr 2004 11:36:10 -0000
@@ -126,7 +126,7 @@ sub simple_request
        my $fresh_until = $robot_res->fresh_until;
        if ($robot_res->is_success) {
            my $c = $robot_res->content;
-           if ($robot_res->content_type =~ m,^text/, && $c =~ /^Disallow\s*:/mi) {
+           if ($robot_res->content_type =~ m,^text/, && $c =~ /^\s*Disallow\s*:/mi) {
                LWP::Debug::debug("Parsing robot rules");
                $self->{'rules'}->parse($robot_url, $c, $fresh_until);
            }
Index: lib/WWW/RobotRules.pm
===================================================================
RCS file: /cvsroot/libwww-perl/lwp5/lib/WWW/RobotRules.pm,v
retrieving revision 1.28
diff -u -p -r1.28 RobotRules.pm
--- lib/WWW/RobotRules.pm       6 Apr 2004 11:10:49 -0000       1.28
+++ lib/WWW/RobotRules.pm       6 Apr 2004 11:36:11 -0000
@@ -54,7 +54,7 @@ sub parse {
            last if $is_me; # That was our record. No need to read the rest.
            $is_anon = 0;
        }
-        elsif (/^User-Agent:\s*(.*)/i) {
+        elsif (/^\s*User-Agent\s*:\s*(.*)/i) {
            $ua = $1;
            $ua =~ s/\s+$//;
            if ($is_me) {
@@ -68,7 +68,7 @@ sub parse {
                $is_me = 1;
            }
        }
-       elsif (/^Disallow\s*:\s*(.*)/i) {
+       elsif (/^\s*Disallow\s*:\s*(.*)/i) {
            unless (defined $ua) {
                warn "RobotRules: Disallow without preceding User-agent\n";
                $is_anon = 1;  # assume that User-agent: * was intended

Reply via email to