Hi
On Thu, 14 Mar 2002 03:08:21 -0700, Sean M Burke (SMB) said SMB> I'm a bit perplexed over whether the current Perl library SMB> WWW::RobotRules implements a certain part of the Robots Exclusion SMB> Standard correctly. So forgive me if this seems a simple SMB> question, but my reading of the Robots Exclusion Standard hasn't SMB> really cleared it up in my mind yet. SMB> SMB> Basically the current WWW::RobotRules logic is this: As a SMB> WWW:::RobotRules object is parsing the lines in the robots.txt SMB> file, if it sees a line that says "User-Agent: ...foo...", it SMB> extracts the foo, and if the name of the current user-agent is a SMB> substring of "...foo...", then it considers this line as applying SMB> to it. [...] SMB> However, the substring matching currently goes only one way. So SMB> if the user-agent object is called "Banjo/1.1 SMB> [http://nowhere.int/banjo.html [EMAIL PROTECTED]]" and the SMB> robots.txt line being parsed says "User-Agent: Thing, Woozle, SMB> Banjo, Stuff", then the library says "'Banjo/1.1 SMB> [http://nowhere.int/banjo.html [EMAIL PROTECTED]]' is NOT a SMB> substring of 'Thing, Woozle, Banjo, Stuff', so this rule is NOT SMB> talking to me!" SMB> [...] SMB> But before I submit a patch, I'm tempted to ask... what /is/ the SMB> proper behavior? [...] I'm sorry, but I think you're mistaken: >From the HTTP spec: (<http://www.w3.org/Protocols/HTTP/HTRQ_Headers.html#user-agent>) "User-Agent: This line if present gives the software program used by the original client. This is for statistical purposes and the tracing of protocol violations. It should be included. The first white space delimited word must be the software product name, with an optional slash and version designator. Other products which form part of the user agent may be put as separate words. <field> = User-Agent: <product>+ <product> = <word> [/<version>] <version> = <word> " That is, the User-Agent (HTTP) header consists of one or more words, and the very first word is taken to be the "name", which is referred to in the robot exclusion files. When you look at the WWW:RobotRules implementation, you will see that the actual comparison is done in the is_me () method, and essentially looks like this: index(lc($self->agent), lc($ua)) >= 0; where $ua is the user agent "name"in the robot exclusion file. I.e. it checks to see whether the user agent "name" is part of the whole UA identifier. Which is exactly what's required. Regards, Martin -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".
