I'm a bit perplexed over whether the current Perl library WWW::RobotRules 
implements a certain part of the Robots Exclusion Standard correctly.  So 
forgive me if this seems a simple question, but my reading of the Robots 
Exclusion Standard hasn't really cleared it up in my mind yet.


Basically the current WWW::RobotRules logic is this:
As a WWW:::RobotRules object is parsing the lines in the robots.txt file, 
if it sees a line that says "User-Agent: ...foo...", it extracts the foo, 
and if the name of the current user-agent is a substring of "...foo...", 
then it considers this line as applying to it.

So if the agent being modeled is called "Banjo", and the robots.txt line 
being parsed says "User-Agent: Thing, Woozle, Banjo, Stuff", then the 
library says "OK, 'Banjo' is a substring in 'Thing, Woozle, Banjo, Stuff', 
so this rule is talking to me!"

However, the substring matching currently goes only one way.  So if the 
user-agent object is called "Banjo/1.1 [http://nowhere.int/banjo.html 
[EMAIL PROTECTED]]" and the robots.txt line being parsed says "User-Agent: 
Thing, Woozle, Banjo, Stuff", then the library says "'Banjo/1.1 
[http://nowhere.int/banjo.html [EMAIL PROTECTED]]' is NOT a substring of 
'Thing, Woozle, Banjo, Stuff', so this rule is NOT talking to me!"

I have the feeling that that's not right -- notably because that means that 
every robot ID string has to appear in toto on the "User-Agent" robots.txt 
line, which is clearly a bad thing.
But before I submit a patch, I'm tempted to ask... what /is/ the proper 
behavior?

Maybe shave the current user-agent's name at the first slash or space 
(getting just "Banjo"), and then seeing if /that/ is a substring of a given 
robots.txt "User-Agent:" line?

--
Sean M. Burke    [EMAIL PROTECTED]    http://www.spinn.net/~sburke/


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".

Reply via email to