RE: [Robots] robot in python?
At 11:47 PM 2003-11-17, SsolSsinclair wrote: Open Source is a project which came into being through a collective effort. Intelligence matching Intelligence. This movement cannot be stopped or prevented, SHORT of ceasing communication of all [resulting in Deaf Silence, and the Elimination of Sound as a sensory perception, clearly not in the interest of any individual or body or civilization, if it were possible in the first place. You talk funny! This pleases me. -- Sean M. Burkehttp://search.cpan.org/~sburke/ ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
[Robots] Re: matching and UserAgent: in robots.txt
I dug around more in Perl LWP's WWW::RobotRules module and the short story is that the bug I found exists, but that it's not as bad as I thought. If you set up a user agent with the name Foobar/1.23, a WWW::RobotRules object actually /does/ currently know to strip off the /1.23 (this happens in the 'agent' method, not in the is_me method where I expected it). The current bug surfaces only when your user-agent name is more than one word; if your user-agent name is Foobar/1.23 [[EMAIL PROTECTED]], the current 'agent' method's logic says well, it doesn't end in '/number.number', so there's no version to strip off. So I'm going to send Gisle Aas a patch so that the first word, minus any version suffix, is what's used for matching. It's just a matter of adding a line saying: $name = $1 if $name =~ m/(\S+)/; # get first word in the 'agent' method. -- Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] matching and UserAgent: in robots.txt
I'm a bit perplexed over whether the current Perl library WWW::RobotRules implements a certain part of the Robots Exclusion Standard correctly. So forgive me if this seems a simple question, but my reading of the Robots Exclusion Standard hasn't really cleared it up in my mind yet. Basically the current WWW::RobotRules logic is this: As a WWW:::RobotRules object is parsing the lines in the robots.txt file, if it sees a line that says User-Agent: ...foo..., it extracts the foo, and if the name of the current user-agent is a substring of ...foo..., then it considers this line as applying to it. So if the agent being modeled is called Banjo, and the robots.txt line being parsed says User-Agent: Thing, Woozle, Banjo, Stuff, then the library says OK, 'Banjo' is a substring in 'Thing, Woozle, Banjo, Stuff', so this rule is talking to me! However, the substring matching currently goes only one way. So if the user-agent object is called Banjo/1.1 [http://nowhere.int/banjo.html [EMAIL PROTECTED]] and the robots.txt line being parsed says User-Agent: Thing, Woozle, Banjo, Stuff, then the library says 'Banjo/1.1 [http://nowhere.int/banjo.html [EMAIL PROTECTED]]' is NOT a substring of 'Thing, Woozle, Banjo, Stuff', so this rule is NOT talking to me! I have the feeling that that's not right -- notably because that means that every robot ID string has to appear in toto on the User-Agent robots.txt line, which is clearly a bad thing. But before I submit a patch, I'm tempted to ask... what /is/ the proper behavior? Maybe shave the current user-agent's name at the first slash or space (getting just Banjo), and then seeing if /that/ is a substring of a given robots.txt User-Agent: line? -- Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: matching and UserAgent: in robots.txt
At 12:49 2002-03-14 -0800, Nick Arnett wrote: [...]That does seem to be a problem, since apparently version numbers were contemplated in User-Agent headers... Sounds like something for the LWP author(s). Yes, we are (hereby) thinking about it. I thought I'd seek the wisdom of the list on this before bringing it up with the others, tho. -- Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].