At 11:31 AM 07/03/02 -0800, Nick Arnett wrote:
>>     * Write it in Perl (or equivalent).
>
>I suppose it doesn't help with a book on Perl, but I'm re-writing my robots
>in Python and I'm very happy with the way it's going.  

I consider Python to fall under "or equivalent" :)

>>     * Consider just staying away from xxx.lanl.gov and
>www.imc.com; it's not worth the grief.
>
>LANL was one of the very early beta sites, perhaps the first, for Verity's
>spider.  Perhaps that early experience led them to see such beasts as more
>of a problem than a solution.  I recall that we discovered some infinite
>recursion there.

The problem isn't the site, it's that that both those sites employed
a paranoid flaming asshole of the first order, who would resort to 
threats and malicious mailbombing with intent to destroy whenever
he decided that a robot was misbehaving.  I ended up having to get
my lawyer to send a letter to the U.S. Dept. of Energy to stop the
mailbombing, which cost me some valuable personal data.  Mind you
this was in 1996... but the memory burns bright.

>    * If you're doing a really big robot, include a real human's email
>address in the HTTP request headers, and be responsive.
>
>When you say "big," Tim, do you mean in terms of breadth, or what?

"big" means "hits lots of websites whose operators you don't know"

>    * Consider doing your DNS as close as possible to the robot machines.
>
>Oh, yes.  Absolutely.  If you're being friendly to servers, you're rotating
>around hosts 

I disagree, see below.  But we agree on DNS.

>    * Consider not using LWP to fetch pages.
>
>And use ?? instead?

I ended up using a stub C program; because LWP's timeout mechanism
is really very shaky; what you really need for a robot is a call that 
says "get a maximum of X bytes from this URL and take a maximum of Y 
seconds to do it and come back when you're finished and tell me what 
happened."  Up until early 2000 (the last time I worked on this) 
LWP couldn't be made to do it.

>    * Consider crawling all the pages from a server, rather than going to a
>random server for each new request.
>
>You mean don't rotate around?

Yeah, it's counter-intuitive but it works really well.  Let's assume
your crawlees don't mind you crawling them (if they do, consider not
writing the robot).  In general I find that if you're courteous, i.e.
hit them with a single thread and wait a few seconds between each 
fetch, site operators don't mind being crawled in one big gulp and
getting it over with.  And if the site is large-scale at all, how
much extra overhead can one robot thread really represent?  And if you 
do this, you can get a whole bunch of optimizations and simplifications - 
you just keep spawning off threads that suck from a single site for a 
fixed period of time or until they've got it all.  And the database
structure you need to back your robot can be made a lot simpler;
think about it.  The increase in throughput is really remarkable.

The only cost is that you need to go from high parallelism to
REALLY VERY HIGH parallelism, but that's OK, just throw some
more RAM at the problem.

>    * Be really careful to avoid ever hitting any server with a lot of
>requests faster than it can deliver them.
>
>Interesting.  How do you do this?  Track response times and slow down your
>requests to match or exceed an average?  Did you ever do this for
>multi-threaded robots?

Yep; it's *really* easy if you're pointing a single thread at a single
site as I suggest; you just wait an amount of time varying between
say 3 and 30 seconds between each page fetch and back off if the
server response slows down.

>    * Be sure that you never order URLs for processing by lexical order.
>
>Because?

Because you end up starting with all the ftp:// URLs, then all
the http://123.123.123.132//, then http://aaa.aabacus.com/ and 
whatever criterion is used to rank importance, this will pretty
clearly not do anything sensible.

>    * Read and use MIME headers, but verify them.
>
>Is there a Perl module for this?

Anyone here know?  -Tim


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".

Reply via email to