> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Tim Bray

[snip]

>     * Write it in Perl (or equivalent).

I suppose it doesn't help with a book on Perl, but I'm re-writing my robots
in Python and I'm very happy with the way it's going.  Performance is
better, but that may be largely because I know a lot more about the problems
I'm solving than I did when I wrote the Perl.  It would be interesting to
come up with some informal criteria for when to use which languages.  For
example, I really like the Python sgmllib SGMLParser module for pulling data
out of tables.

>     * Consider achieving high parallelism by flinging money at
> computers & RAM.
>     * Learn the Robot Exclusion Protocol and take it seriously.
> For example, watch out for redirects.
>     * Consider just staying away from xxx.lanl.gov and
www.imc.com; it's not worth the grief.

LANL was one of the very early beta sites, perhaps the first, for Verity's
spider.  Perhaps that early experience led them to see such beasts as more
of a problem than a solution.  I recall that we discovered some infinite
recursion there.

    * If you're doing a really big robot, include a real human's email
address in the HTTP request headers, and be responsive.

When you say "big," Tim, do you mean in terms of breadth, or what?

    * Consider doing your DNS as close as possible to the robot machines.

Oh, yes.  Absolutely.  If you're being friendly to servers, you're rotating
around hosts so that you don't hit any one of them for long, which means
you're doing a lot of name resolution.  I've never considered caching in the
robot itself, but perhaps...

    * Assume that robot processes are going to freeze, lock up, go into
loops, etc, from time to time, and build auto-recovery before you need it.
    * Consider not using LWP to fetch pages.

And use ?? instead?

    * Try really hard to use the LWP REP implementation, because it's good.
    * Consider crawling all the pages from a server, rather than going to a
random server for each new request.

You mean don't rotate around?

    * Be really careful to avoid ever hitting any server with a lot of
requests faster than it can deliver them.

Interesting.  How do you do this?  Track response times and slow down your
requests to match or exceed an average?  Did you ever do this for
multi-threaded robots?

    * Be sure that you never order URLs for processing by lexical order.

Because?

    * Read and use MIME headers, but verify them.

Is there a Perl module for this?

    * Ensure your management understands that your robot cannot detect
"sites" or "home pages".
    * Ensure your management understands that your robot cannot detect
"good" pages.

Mine can, sort of!  Of course, like you, I'm my own management.  ;-)

Interesting stuff, Tim.

I don't want to talk too much in public about what I'm working on, but for
the last few years, my focus has been discovery of reputation and
influence...

Nick


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".

Reply via email to