The problem... if I include a space in my robot's user agent, it will fail to
recognize robots.txt records targeted to my robot.
My robot's user agent:
Hispanic Business Inc. Spider/1.0
Robots.txt file:
User-agent: Hispanic Business Inc. Spider
Disallow:
User-agent: *
Disallow: /
My robot will incorrectly refuse to spider anything, because
WWW::RobotRules::agent shortens $self->{'ua'} to "Hispanic".
I propose the attached patch to the RobotRules.pm included in libwww-perl 5.803
--
Matthew.van.Eerde (at) hbinc.com 805.964.4554 x902
Hispanic Business Inc./HireDiversity.com Software Engineer
--- libwww-perl-5.803/lib/WWW/RobotRules.pm.original 2005-10-13
16:26:27.000000000 -0700
+++ libwww-perl-5.803/lib/WWW/RobotRules.pm 2005-10-13 16:27:27.000000000
-0700
@@ -185,8 +185,8 @@
# "FooBot/1.2" => "FooBot"
# "FooBot/1.2 [http://foobot.int; [EMAIL PROTECTED]" => "FooBot"
- $name = $1 if $name =~ m/(\S+)/; # get first word
$name =~ s!/.*!!; # get rid of version
+ $name =~ s/\s+$//; # get rid of trailing space
unless ($old && $old eq $name) {
delete $self->{'loc'}; # all old info is now stale
$self->{'ua'} = $name;