At 12:47 2002-03-14 +0100, Martin Beet wrote:
>  On Thu, 14 Mar 2002 03:08:21 -0700, Sean M Burke (SMB) said
>SMB> I'm a bit perplexed over whether the current Perl library
>SMB> WWW::RobotRules implements a certain part of the Robots Exclusion
>SMB> Standard correctly.  So forgive me if this seems a simple
>SMB> question, but my reading of the Robots Exclusion Standard hasn't
>SMB> really cleared it up in my mind yet.
>[...]
>When you look at the WWW:RobotRules implementation, you will see that
>the actual comparison is done in the is_me () method, and essentially
>looks like this: [...] where $ua is the user agent "name"in the robot 
>exclusion file. I.e.
>it checks to see whether the user agent "name" is part of the whole
>UA identifier. Which is exactly what's required.

Well, the code in full looks like this:

# is_me()
#
# Returns TRUE if the given name matches the
# name of this robot
#
sub is_me {
     my($self, $ua) = @_;
     my $me = $self->agent;
     return index(lc($me), lc($ua)) >= 0;
}

But notice that it's asking whether the /whole/ agent name (like "Foo", 
"Foo/1.2", "Foo/1.2 (Stuff Blargle Gleiven; hoohah blingbling1231451)" is a 
substring of the content in "User-Agent: ...content..." (the content is 
what's passed to $thing->is_me($content))

I think that what it /should/ do (given what the various specs say) is this:

sub is_me {
     my($self, $ua) = @_;
     my $me = $self->agent;
     $me = $1 if $me =~ m<(\S+)>; # first word
     $me =~ s</\d+(\.\d+)?$><> or $me =~ s</\.\d+$><>;
       # remove version string
     return index(lc($me), lc($ua)) >= 0;
}

where that regexp extracts the "Foo" in all of: "Foo", "Foo/1.2", and 
"Foo/1.2 (Stuff Blargle Gleiven; hoohah blingbling1231451)".


E.g.,  http://www.robotstxt.org/wc/norobots.html says:
<<User-agent [...] The robot should be liberal in interpreting this field. 
A case insensitive substring match of the name without version information 
is recommended.>>

...note the "without version information".  Ditto the spec you cited, which 
says <<That is, the User-Agent (HTTP) header consists of one or more words, 
and the very first word is taken to be the "name", which is referred to in 
the robot exclusion files.>>


--
Sean M. Burke    [EMAIL PROTECTED]    http://www.spinn.net/~sburke/


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".

Reply via email to