Craig Macdonald <[EMAIL PROTECTED]> writes: > Hi, just a short note to suggest a 1-line change to WWW::RobotRules. > > When loading, http://www.maths.gla.ac.uk/robots.txt I noticed > WWW::RobotRules giving me warnings: > > RobotRules: Unexpected line: User-agent: * > RobotRules: Unexpected line: Disallow: /error/ > RobotRules: Unexpected line: Disallow: /tla_review/ > > etc. > > The problem is that WWW::RobotRules doesn't support leading space on a > robots.txt line. As such, I would suggest adding > s/^\s*//; > at line 51 of RobotRules.pm. > > I'm not sure how frequent a problem this might be, but it seems > important to make WWW::RobotRules as robust at parsing robots.txt files > as possible, in order to prevent parts of sites being crawled that > shouldn't be.
The spec at <http://www.robotstxt.org/wc/norobots.html> states that leading space is not allowed, but I agree that LWP should be a bit more liberal when parsing. I've now applied the following patch. Regads, Gisle Index: lib/LWP/RobotUA.pm =================================================================== RCS file: /cvsroot/libwww-perl/lwp5/lib/LWP/RobotUA.pm,v retrieving revision 1.24 diff -u -p -r1.24 RobotUA.pm --- lib/LWP/RobotUA.pm 6 Apr 2004 11:02:50 -0000 1.24 +++ lib/LWP/RobotUA.pm 6 Apr 2004 11:36:10 -0000 @@ -126,7 +126,7 @@ sub simple_request my $fresh_until = $robot_res->fresh_until; if ($robot_res->is_success) { my $c = $robot_res->content; - if ($robot_res->content_type =~ m,^text/, && $c =~ /^Disallow\s*:/mi) { + if ($robot_res->content_type =~ m,^text/, && $c =~ /^\s*Disallow\s*:/mi) { LWP::Debug::debug("Parsing robot rules"); $self->{'rules'}->parse($robot_url, $c, $fresh_until); } Index: lib/WWW/RobotRules.pm =================================================================== RCS file: /cvsroot/libwww-perl/lwp5/lib/WWW/RobotRules.pm,v retrieving revision 1.28 diff -u -p -r1.28 RobotRules.pm --- lib/WWW/RobotRules.pm 6 Apr 2004 11:10:49 -0000 1.28 +++ lib/WWW/RobotRules.pm 6 Apr 2004 11:36:11 -0000 @@ -54,7 +54,7 @@ sub parse { last if $is_me; # That was our record. No need to read the rest. $is_anon = 0; } - elsif (/^User-Agent:\s*(.*)/i) { + elsif (/^\s*User-Agent\s*:\s*(.*)/i) { $ua = $1; $ua =~ s/\s+$//; if ($is_me) { @@ -68,7 +68,7 @@ sub parse { $is_me = 1; } } - elsif (/^Disallow\s*:\s*(.*)/i) { + elsif (/^\s*Disallow\s*:\s*(.*)/i) { unless (defined $ua) { warn "RobotRules: Disallow without preceding User-agent\n"; $is_anon = 1; # assume that User-agent: * was intended