At 12:47 2002-03-14 +0100, Martin Beet wrote: > On Thu, 14 Mar 2002 03:08:21 -0700, Sean M Burke (SMB) said >SMB> I'm a bit perplexed over whether the current Perl library >SMB> WWW::RobotRules implements a certain part of the Robots Exclusion >SMB> Standard correctly. So forgive me if this seems a simple >SMB> question, but my reading of the Robots Exclusion Standard hasn't >SMB> really cleared it up in my mind yet. >[...] >When you look at the WWW:RobotRules implementation, you will see that >the actual comparison is done in the is_me () method, and essentially >looks like this: [...] where $ua is the user agent "name"in the robot >exclusion file. I.e. >it checks to see whether the user agent "name" is part of the whole >UA identifier. Which is exactly what's required.
Well, the code in full looks like this: # is_me() # # Returns TRUE if the given name matches the # name of this robot # sub is_me { my($self, $ua) = @_; my $me = $self->agent; return index(lc($me), lc($ua)) >= 0; } But notice that it's asking whether the /whole/ agent name (like "Foo", "Foo/1.2", "Foo/1.2 (Stuff Blargle Gleiven; hoohah blingbling1231451)" is a substring of the content in "User-Agent: ...content..." (the content is what's passed to $thing->is_me($content)) I think that what it /should/ do (given what the various specs say) is this: sub is_me { my($self, $ua) = @_; my $me = $self->agent; $me = $1 if $me =~ m<(\S+)>; # first word $me =~ s</\d+(\.\d+)?$><> or $me =~ s</\.\d+$><>; # remove version string return index(lc($me), lc($ua)) >= 0; } where that regexp extracts the "Foo" in all of: "Foo", "Foo/1.2", and "Foo/1.2 (Stuff Blargle Gleiven; hoohah blingbling1231451)". E.g., http://www.robotstxt.org/wc/norobots.html says: <<User-agent [...] The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.>> ...note the "without version information". Ditto the spec you cited, which says <<That is, the User-Agent (HTTP) header consists of one or more words, and the very first word is taken to be the "name", which is referred to in the robot exclusion files.>> -- Sean M. Burke [EMAIL PROTECTED] http://www.spinn.net/~sburke/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".