RE: [Robots] robot in python?
At 11:47 PM 2003-11-17, SsolSsinclair wrote: Open Source is a project which came into being through a collective effort. Intelligence matching Intelligence. This movement cannot be stopped or prevented, SHORT of ceasing communication of all [resulting in Deaf Silence, and the Elimination of Sound as a sensory perception, clearly not in the interest of any individual or body or civilization, if it were possible in the first place. You talk funny! This pleases me. -- Sean M. Burkehttp://search.cpan.org/~sburke/ ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
[Robots] leading whitespace in robots.txt files
Recently I saw LWP's WWW::RobotRules seeing a robots.txt file that looked like this: # User-agent: * Disallow: /cgi-bin/ Disallow: /~mojojojo/misc/ It complained about the Disallow lines being "unexpected". The regexp it was using for these things is: /^Disallow:\s*(.*)/i So I've changed it to this, and was about to submit it as a patch for the next LWP release: /^\s*Disallow:\s*(.*)/i # Silently forgive leading whitespace. But first, I thought I'd ask the list here: does anyone thing this'd break anything? I sure hope no-one out there is using leading-whitespace lines as comments, or as RFC-822-style continuation lines! Thoughts, anyone? -- Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/
[Robots] Re: better language for writing a Spider ?
At 10:36 2002-03-14 -0800, Nick Arnett wrote: >[with Python] I'm not seeing long, mysterious time-outs as I occasionally >did with LWP. I have never run into this problem, but I have a dim memory that you may be alluding to what is a known bug not with LWP, but with old versions (long since fixed in modern Perls and/or CPAN) of the socket libraries in IO::*. >Following an LWP request through the debugger is a long and convoluted >journey... Are you referring to perl -d, or LWP::Debug? Maybe I should write an addendum to "lwpcook.pod" on figuring out what's going wrong, when something does go wrong. The current lwpcook really needs an overhaul, and once my book /Perl & LWP/ is done (hopefully it'll be in press within a few weeks), I hope to send up some big doc patches to LWP, at the very least revamping lwpcook and then going into each class and noting in the docs whether a typical user needs to bother knowing about it. (E.g., you need to know about HTTP::Response; you do /not/ need to know about LWP::Protocol.) In short, if people want to see improvements to LWP, email me and say what you want done, and I'll either try my hand at implementing it, or I'll pass it on to someone more capable. LWP is not the product of a massive bureaucracy, but of few enough people that you could fit all of us in a phone booth. We're all manically busy, to varying degrees (companies to run, children to raise, books/articles/modules to write, etc.), but we do at times manage to do what needs doing, if it's pointed out clearly enough to stand out from the torrent of email messages (which I find incessantly discouraging) that manage no better than "halo I try to use LWP with hotmel but not work plz hlp k thx". -- Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".
[Robots] Re: matching and "UserAgent:" in robots.txt
I dug around more in Perl LWP's WWW::RobotRules module and the short story is that the bug I found exists, but that it's not as bad as I thought. If you set up a user agent with the name "Foobar/1.23", a WWW::RobotRules object actually /does/ currently know to strip off the "/1.23" (this happens in the 'agent' method, not in the is_me method where I expected it). The current bug surfaces only when your user-agent name is more than one word; if your user-agent name is "Foobar/1.23 [[EMAIL PROTECTED]]", the current 'agent' method's logic says "well, it doesn't end in '/number.number', so there's no version to strip off". So I'm going to send Gisle Aas a patch so that the first word, minus any version suffix, is what's used for matching. It's just a matter of adding a line saying: $name = $1 if $name =~ m/(\S+)/; # get first word in the 'agent' method. -- Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".
[Robots] Re: matching and "UserAgent:" in robots.txt
At 12:49 2002-03-14 -0800, Nick Arnett wrote: >[...]That does seem to be a problem, since apparently >version numbers were contemplated in User-Agent headers... Sounds like >something for the LWP author(s). Yes, we are (hereby) thinking about it. I thought I'd seek the wisdom of the list on this before bringing it up with the others, tho. -- Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".
[Robots] Re: matching and "UserAgent:" in robots.txt
At 12:47 2002-03-14 +0100, Martin Beet wrote: > On Thu, 14 Mar 2002 03:08:21 -0700, Sean M Burke (SMB) said >SMB> I'm a bit perplexed over whether the current Perl library >SMB> WWW::RobotRules implements a certain part of the Robots Exclusion >SMB> Standard correctly. So forgive me if this seems a simple >SMB> question, but my reading of the Robots Exclusion Standard hasn't >SMB> really cleared it up in my mind yet. >[...] >When you look at the WWW:RobotRules implementation, you will see that >the actual comparison is done in the is_me () method, and essentially >looks like this: [...] where $ua is the user agent "name"in the robot >exclusion file. I.e. >it checks to see whether the user agent "name" is part of the whole >UA identifier. Which is exactly what's required. Well, the code in full looks like this: # is_me() # # Returns TRUE if the given name matches the # name of this robot # sub is_me { my($self, $ua) = @_; my $me = $self->agent; return index(lc($me), lc($ua)) >= 0; } But notice that it's asking whether the /whole/ agent name (like "Foo", "Foo/1.2", "Foo/1.2 (Stuff Blargle Gleiven; hoohah blingbling1231451)" is a substring of the content in "User-Agent: ...content..." (the content is what's passed to $thing->is_me($content)) I think that what it /should/ do (given what the various specs say) is this: sub is_me { my($self, $ua) = @_; my $me = $self->agent; $me = $1 if $me =~ m<(\S+)>; # first word $me =~ s<> or $me =~ s<>; # remove version string return index(lc($me), lc($ua)) >= 0; } where that regexp extracts the "Foo" in all of: "Foo", "Foo/1.2", and "Foo/1.2 (Stuff Blargle Gleiven; hoohah blingbling1231451)". E.g., http://www.robotstxt.org/wc/norobots.html says: <> ...note the "without version information". Ditto the spec you cited, which says <> -- Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".
[Robots] Re: matching and "UserAgent:" in robots.txt
Oops, I just noticed that my topic has "UserAgent:" where I meant "User-Agent:" -- Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".
[Robots] matching and "UserAgent:" in robots.txt
I'm a bit perplexed over whether the current Perl library WWW::RobotRules implements a certain part of the Robots Exclusion Standard correctly. So forgive me if this seems a simple question, but my reading of the Robots Exclusion Standard hasn't really cleared it up in my mind yet. Basically the current WWW::RobotRules logic is this: As a WWW:::RobotRules object is parsing the lines in the robots.txt file, if it sees a line that says "User-Agent: ...foo...", it extracts the foo, and if the name of the current user-agent is a substring of "...foo...", then it considers this line as applying to it. So if the agent being modeled is called "Banjo", and the robots.txt line being parsed says "User-Agent: Thing, Woozle, Banjo, Stuff", then the library says "OK, 'Banjo' is a substring in 'Thing, Woozle, Banjo, Stuff', so this rule is talking to me!" However, the substring matching currently goes only one way. So if the user-agent object is called "Banjo/1.1 [http://nowhere.int/banjo.html [EMAIL PROTECTED]]" and the robots.txt line being parsed says "User-Agent: Thing, Woozle, Banjo, Stuff", then the library says "'Banjo/1.1 [http://nowhere.int/banjo.html [EMAIL PROTECTED]]' is NOT a substring of 'Thing, Woozle, Banjo, Stuff', so this rule is NOT talking to me!" I have the feeling that that's not right -- notably because that means that every robot ID string has to appear in toto on the "User-Agent" robots.txt line, which is clearly a bad thing. But before I submit a patch, I'm tempted to ask... what /is/ the proper behavior? Maybe shave the current user-agent's name at the first slash or space (getting just "Banjo"), and then seeing if /that/ is a substring of a given robots.txt "User-Agent:" line? -- Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".
[Robots] Re: Perl and LWP robots
The replies to my request for advice have been very helpful! I'll pick one and reply to it: At 10:01 2002-03-07 -0800, Otis Gospodnetic wrote: >[about my forthcoming book] >(i.e. I'm a potential customer :)) When will it be published? It's probably going into tech edit later this month. So it'll probably be out this summer. (Altho bear in mind that I live in New Mexico, where summer is just about everything between February and December.) >I think lots of people do want to know about recursive spiders, and I >bet one of the most frequent obstacles are issues like: queueing, depth >vs. breadth first crawling, (memory) efficient storage of extracted and >crawled links, etc. I'm getting the feeling that I should see spiders as of two kinds: kinds that spider everything under a given URL (like "http://www.speech.cs.cmu.edu/~sburke/pub/"; or "http://www.";), and kinds that go hog wide across all of the Web. The usefulness of the single-host spiders is pretty obvious to me. But why do people want to write spiders that potentially span all/any hosts? (Aside from people who are working for Google or similar.) -- Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".
[Robots] Perl and LWP robots
Hi all! My name is Sean Burke, and I'm writing a book for O'Reilly, which is to basically replace the Clinton Wong's now out-of-print /Web Client Programming with Perl/. In my book draft so far, I haven't discussed actual recursive spiders (I've only discussed getting a given page, and then every page that it links to which is also on the same host), since I think that most readers that think they want a recursive spider, really don't. But it has been suggested that I cover recursive spiders, just for sake of completeness. Aside from basic concepts (don't hammer the server; always obey the robots.txt; don't span hosts unless you are really sure that you want to), are there any particular bits of wisdom that list members would want me to pass on to my readers? -- Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".