LWP?  Very popular in a big Perl community.

--- Rasmus Mohr <[EMAIL PROTECTED]> wrote:
> 
> Any idea how widespread the use of this library is? We've observed
> some
> weird behaviors from some of the major search engines' spiders
> (basically
> ignoring robots.txt sections) - maybe this is the explanation?
> 
> --------------------------------------------------------------
> Rasmus T. Mohr            Direct  :             +45 36 910 122
> Application Developer     Mobile  :             +45 28 731 827
> Netpointers Intl. ApS     Phone   :             +45 70 117 117
> Vestergade 18 B           Fax     :             +45 70 115 115
> 1456 Copenhagen K         Email   : mailto:[EMAIL PROTECTED]
> Denmark                   Website : http://www.netpointers.com
> 
> "Remember that there are no bugs, only undocumented features."
> --------------------------------------------------------------
> 
> -----Oprindelig meddelelse-----
> Fra: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]Pa
> vegne af Sean M. Burke
> Sendt: 14. marts 2002 11:08
> Til: [EMAIL PROTECTED]
> Emne: [Robots] matching and "UserAgent:" in robots.txt
> 
> 
> 
> I'm a bit perplexed over whether the current Perl library
> WWW::RobotRules 
> implements a certain part of the Robots Exclusion Standard correctly.
>  So 
> forgive me if this seems a simple question, but my reading of the
> Robots 
> Exclusion Standard hasn't really cleared it up in my mind yet.
> 
> 
> Basically the current WWW::RobotRules logic is this:
> As a WWW:::RobotRules object is parsing the lines in the robots.txt
> file, 
> if it sees a line that says "User-Agent: ...foo...", it extracts the
> foo, 
> and if the name of the current user-agent is a substring of
> "...foo...", 
> then it considers this line as applying to it.
> 
> So if the agent being modeled is called "Banjo", and the robots.txt
> line 
> being parsed says "User-Agent: Thing, Woozle, Banjo, Stuff", then the
> 
> library says "OK, 'Banjo' is a substring in 'Thing, Woozle, Banjo,
> Stuff', 
> so this rule is talking to me!"
> 
> However, the substring matching currently goes only one way.  So if
> the 
> user-agent object is called "Banjo/1.1 [http://nowhere.int/banjo.html
> 
> [EMAIL PROTECTED]]" and the robots.txt line being parsed says
> "User-Agent: 
> Thing, Woozle, Banjo, Stuff", then the library says "'Banjo/1.1 
> [http://nowhere.int/banjo.html [EMAIL PROTECTED]]' is NOT a substring
> of 
> 'Thing, Woozle, Banjo, Stuff', so this rule is NOT talking to me!"
> 
> I have the feeling that that's not right -- notably because that
> means that 
> every robot ID string has to appear in toto on the "User-Agent"
> robots.txt 
> line, which is clearly a bad thing.
> But before I submit a patch, I'm tempted to ask... what /is/ the
> proper 
> behavior?
> 
> Maybe shave the current user-agent's name at the first slash or space
> 
> (getting just "Banjo"), and then seeing if /that/ is a substring of a
> given 
> robots.txt "User-Agent:" line?
> 
> --
> Sean M. Burke    [EMAIL PROTECTED]    http://www.spinn.net/~sburke/
> 
> 
> --
> This message was sent by the Internet robots and spiders discussion
> list
> ([EMAIL PROTECTED]).  For list server commands, send "help" in the
> body of
> a message to "[EMAIL PROTECTED]".
> 
> --
> This message was sent by the Internet robots and spiders discussion
> list ([EMAIL PROTECTED]).  For list server commands, send "help" in
> the body of a message to "[EMAIL PROTECTED]".


__________________________________________________
Do You Yahoo!?
Yahoo! Sports - live college hoops coverage
http://sports.yahoo.com/

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".

Reply via email to