RE: [Robots] robot in python?

2003-11-26 Thread Sean M. Burke
At 11:47 PM 2003-11-17, SsolSsinclair wrote:
Open Source is a project which came into being through a collective 
effort. Intelligence matching Intelligence. This movement cannot be 
stopped or prevented, SHORT of ceasing communication of all [resulting in 
Deaf Silence, and the Elimination of Sound as a sensory perception, 
clearly not in the interest of any individual or body or civilization, if 
it were possible in the first place.
You talk funny!

This pleases me.

--
Sean M. Burkehttp://search.cpan.org/~sburke/
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


[Robots] Re: matching and UserAgent: in robots.txt

2002-03-15 Thread Sean M. Burke


I dug around more in Perl LWP's WWW::RobotRules module and the short story 
is that the bug I found exists, but that it's not as bad as I thought.
If you set up a user agent with the name Foobar/1.23, a WWW::RobotRules 
object actually /does/ currently know to strip off the /1.23 (this 
happens in the 'agent' method, not in the is_me method where I expected it).

The current bug surfaces only when your user-agent name is more than one 
word; if your user-agent name is Foobar/1.23 [[EMAIL PROTECTED]], the 
current 'agent' method's logic says well, it doesn't end in 
'/number.number', so there's no version to strip off.

So I'm going to send Gisle Aas a patch so that the first word, minus any 
version suffix, is what's used for matching.  It's just a matter of adding 
a line saying:
 $name = $1 if $name =~ m/(\S+)/; # get first word
in the 'agent' method.


--
Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] matching and UserAgent: in robots.txt

2002-03-14 Thread Sean M. Burke


I'm a bit perplexed over whether the current Perl library WWW::RobotRules 
implements a certain part of the Robots Exclusion Standard correctly.  So 
forgive me if this seems a simple question, but my reading of the Robots 
Exclusion Standard hasn't really cleared it up in my mind yet.


Basically the current WWW::RobotRules logic is this:
As a WWW:::RobotRules object is parsing the lines in the robots.txt file, 
if it sees a line that says User-Agent: ...foo..., it extracts the foo, 
and if the name of the current user-agent is a substring of ...foo..., 
then it considers this line as applying to it.

So if the agent being modeled is called Banjo, and the robots.txt line 
being parsed says User-Agent: Thing, Woozle, Banjo, Stuff, then the 
library says OK, 'Banjo' is a substring in 'Thing, Woozle, Banjo, Stuff', 
so this rule is talking to me!

However, the substring matching currently goes only one way.  So if the 
user-agent object is called Banjo/1.1 [http://nowhere.int/banjo.html 
[EMAIL PROTECTED]] and the robots.txt line being parsed says User-Agent: 
Thing, Woozle, Banjo, Stuff, then the library says 'Banjo/1.1 
[http://nowhere.int/banjo.html [EMAIL PROTECTED]]' is NOT a substring of 
'Thing, Woozle, Banjo, Stuff', so this rule is NOT talking to me!

I have the feeling that that's not right -- notably because that means that 
every robot ID string has to appear in toto on the User-Agent robots.txt 
line, which is clearly a bad thing.
But before I submit a patch, I'm tempted to ask... what /is/ the proper 
behavior?

Maybe shave the current user-agent's name at the first slash or space 
(getting just Banjo), and then seeing if /that/ is a substring of a given 
robots.txt User-Agent: line?

--
Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Re: matching and UserAgent: in robots.txt

2002-03-14 Thread Sean M. Burke


At 12:49 2002-03-14 -0800, Nick Arnett wrote:
[...]That does seem to be a problem, since apparently
version numbers were contemplated in User-Agent headers...  Sounds like
something for the LWP author(s).

Yes, we are (hereby) thinking about it.
I thought I'd seek the wisdom of the list on this before bringing it up 
with the others, tho.


--
Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].