At 11:31 AM 07/03/02 -0800, Nick Arnett wrote: >> * Write it in Perl (or equivalent). > >I suppose it doesn't help with a book on Perl, but I'm re-writing my robots >in Python and I'm very happy with the way it's going.
I consider Python to fall under "or equivalent" :) >> * Consider just staying away from xxx.lanl.gov and >www.imc.com; it's not worth the grief. > >LANL was one of the very early beta sites, perhaps the first, for Verity's >spider. Perhaps that early experience led them to see such beasts as more >of a problem than a solution. I recall that we discovered some infinite >recursion there. The problem isn't the site, it's that that both those sites employed a paranoid flaming asshole of the first order, who would resort to threats and malicious mailbombing with intent to destroy whenever he decided that a robot was misbehaving. I ended up having to get my lawyer to send a letter to the U.S. Dept. of Energy to stop the mailbombing, which cost me some valuable personal data. Mind you this was in 1996... but the memory burns bright. > * If you're doing a really big robot, include a real human's email >address in the HTTP request headers, and be responsive. > >When you say "big," Tim, do you mean in terms of breadth, or what? "big" means "hits lots of websites whose operators you don't know" > * Consider doing your DNS as close as possible to the robot machines. > >Oh, yes. Absolutely. If you're being friendly to servers, you're rotating >around hosts I disagree, see below. But we agree on DNS. > * Consider not using LWP to fetch pages. > >And use ?? instead? I ended up using a stub C program; because LWP's timeout mechanism is really very shaky; what you really need for a robot is a call that says "get a maximum of X bytes from this URL and take a maximum of Y seconds to do it and come back when you're finished and tell me what happened." Up until early 2000 (the last time I worked on this) LWP couldn't be made to do it. > * Consider crawling all the pages from a server, rather than going to a >random server for each new request. > >You mean don't rotate around? Yeah, it's counter-intuitive but it works really well. Let's assume your crawlees don't mind you crawling them (if they do, consider not writing the robot). In general I find that if you're courteous, i.e. hit them with a single thread and wait a few seconds between each fetch, site operators don't mind being crawled in one big gulp and getting it over with. And if the site is large-scale at all, how much extra overhead can one robot thread really represent? And if you do this, you can get a whole bunch of optimizations and simplifications - you just keep spawning off threads that suck from a single site for a fixed period of time or until they've got it all. And the database structure you need to back your robot can be made a lot simpler; think about it. The increase in throughput is really remarkable. The only cost is that you need to go from high parallelism to REALLY VERY HIGH parallelism, but that's OK, just throw some more RAM at the problem. > * Be really careful to avoid ever hitting any server with a lot of >requests faster than it can deliver them. > >Interesting. How do you do this? Track response times and slow down your >requests to match or exceed an average? Did you ever do this for >multi-threaded robots? Yep; it's *really* easy if you're pointing a single thread at a single site as I suggest; you just wait an amount of time varying between say 3 and 30 seconds between each page fetch and back off if the server response slows down. > * Be sure that you never order URLs for processing by lexical order. > >Because? Because you end up starting with all the ftp:// URLs, then all the http://123.123.123.132//, then http://aaa.aabacus.com/ and whatever criterion is used to rank importance, this will pretty clearly not do anything sensible. > * Read and use MIME headers, but verify them. > >Is there a Perl module for this? Anyone here know? -Tim -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".
