LWP? Very popular in a big Perl community.
--- Rasmus Mohr <[EMAIL PROTECTED]> wrote: > > Any idea how widespread the use of this library is? We've observed > some > weird behaviors from some of the major search engines' spiders > (basically > ignoring robots.txt sections) - maybe this is the explanation? > > -------------------------------------------------------------- > Rasmus T. Mohr Direct : +45 36 910 122 > Application Developer Mobile : +45 28 731 827 > Netpointers Intl. ApS Phone : +45 70 117 117 > Vestergade 18 B Fax : +45 70 115 115 > 1456 Copenhagen K Email : mailto:[EMAIL PROTECTED] > Denmark Website : http://www.netpointers.com > > "Remember that there are no bugs, only undocumented features." > -------------------------------------------------------------- > > -----Oprindelig meddelelse----- > Fra: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]Pa > vegne af Sean M. Burke > Sendt: 14. marts 2002 11:08 > Til: [EMAIL PROTECTED] > Emne: [Robots] matching and "UserAgent:" in robots.txt > > > > I'm a bit perplexed over whether the current Perl library > WWW::RobotRules > implements a certain part of the Robots Exclusion Standard correctly. > So > forgive me if this seems a simple question, but my reading of the > Robots > Exclusion Standard hasn't really cleared it up in my mind yet. > > > Basically the current WWW::RobotRules logic is this: > As a WWW:::RobotRules object is parsing the lines in the robots.txt > file, > if it sees a line that says "User-Agent: ...foo...", it extracts the > foo, > and if the name of the current user-agent is a substring of > "...foo...", > then it considers this line as applying to it. > > So if the agent being modeled is called "Banjo", and the robots.txt > line > being parsed says "User-Agent: Thing, Woozle, Banjo, Stuff", then the > > library says "OK, 'Banjo' is a substring in 'Thing, Woozle, Banjo, > Stuff', > so this rule is talking to me!" > > However, the substring matching currently goes only one way. So if > the > user-agent object is called "Banjo/1.1 [http://nowhere.int/banjo.html > > [EMAIL PROTECTED]]" and the robots.txt line being parsed says > "User-Agent: > Thing, Woozle, Banjo, Stuff", then the library says "'Banjo/1.1 > [http://nowhere.int/banjo.html [EMAIL PROTECTED]]' is NOT a substring > of > 'Thing, Woozle, Banjo, Stuff', so this rule is NOT talking to me!" > > I have the feeling that that's not right -- notably because that > means that > every robot ID string has to appear in toto on the "User-Agent" > robots.txt > line, which is clearly a bad thing. > But before I submit a patch, I'm tempted to ask... what /is/ the > proper > behavior? > > Maybe shave the current user-agent's name at the first slash or space > > (getting just "Banjo"), and then seeing if /that/ is a substring of a > given > robots.txt "User-Agent:" line? > > -- > Sean M. Burke [EMAIL PROTECTED] http://www.spinn.net/~sburke/ > > > -- > This message was sent by the Internet robots and spiders discussion > list > ([EMAIL PROTECTED]). For list server commands, send "help" in the > body of > a message to "[EMAIL PROTECTED]". > > -- > This message was sent by the Internet robots and spiders discussion > list ([EMAIL PROTECTED]). For list server commands, send "help" in > the body of a message to "[EMAIL PROTECTED]". __________________________________________________ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".
