[Robots] Re: matching and "UserAgent:" in robots.txt

Martin Beet Thu, 14 Mar 2002 04:29:45 -0800

Hi


 On Thu, 14 Mar 2002 03:08:21 -0700, Sean M Burke (SMB) said
SMB> I'm a bit perplexed over whether the current Perl library
SMB> WWW::RobotRules implements a certain part of the Robots Exclusion
SMB> Standard correctly.  So forgive me if this seems a simple
SMB> question, but my reading of the Robots Exclusion Standard hasn't
SMB> really cleared it up in my mind yet.
SMB> 
SMB> Basically the current WWW::RobotRules logic is this: As a
SMB> WWW:::RobotRules object is parsing the lines in the robots.txt
SMB> file, if it sees a line that says "User-Agent: ...foo...", it
SMB> extracts the foo, and if the name of the current user-agent is a
SMB> substring of "...foo...", then it considers this line as applying
SMB> to it.
[...]
SMB> However, the substring matching currently goes only one way.  So
SMB> if the user-agent object is called "Banjo/1.1
SMB> [http://nowhere.int/banjo.html [EMAIL PROTECTED]]" and the
SMB> robots.txt line being parsed says "User-Agent: Thing, Woozle,
SMB> Banjo, Stuff", then the library says "'Banjo/1.1
SMB> [http://nowhere.int/banjo.html [EMAIL PROTECTED]]' is NOT a
SMB> substring of 'Thing, Woozle, Banjo, Stuff', so this rule is NOT
SMB> talking to me!"
SMB> 
[...]
SMB> But before I submit a patch, I'm tempted to ask... what /is/ the
SMB> proper behavior?
[...]

I'm sorry, but I think you're mistaken:

>From the HTTP spec:
(<http://www.w3.org/Protocols/HTTP/HTRQ_Headers.html#user-agent>)

"User-Agent:

  This line if present gives the software program used by the original
  client. This is for statistical purposes and the tracing of protocol
  violations. It should be included. The first white space delimited
  word must be the software product name, with an optional slash and
  version designator.

  Other products which form part of the user agent may be put as
  separate words.

        <field>   =   User-Agent: <product>+
        <product> =   <word> [/<version>]
        <version> =   <word>
"

That is, the User-Agent (HTTP) header consists of one or more words,
and the very first word is taken to be the "name", which is referred
to in the robot exclusion files.

When you look at the WWW:RobotRules implementation, you will see that
the actual comparison is done in the is_me () method, and essentially
looks like this:

 index(lc($self->agent), lc($ua)) >= 0;

where $ua is the user agent "name"in the robot exclusion file. I.e.
it checks to see whether the user agent "name" is part of the whole
UA identifier. Which is exactly what's required.

Regards, Martin





--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".

[Robots] Re: matching and "UserAgent:" in robots.txt

Reply via email to