Alan Perkins wrote:

> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Hit rate
> This directive could indicate to a robot how long to wait between
> requests to the server. Currently it is accepted practice to wait at
> least 30 seconds between requests, but this is too fast for some sites,
> too slow for others.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>
> What's the current accepted practice for hit rate?  Does it vary
> according to

With the availability of persistent connections, a robot that drops the
connection or keeps the connection open for 30 seconds without requesting
another resource would not do the server any good. Large sites generally
have good connectivity and robots can request resources at a higher rate
without any performance degradation, regardless of response code. If a robot
does find that a site is responding slowly (latency or throughput) it should
reduce the hit rate or even suspend crawling temporarily to avoid
overloading a server.

> A) The HTTP response (e.g. no need to wait 30 seconds after a 304)

I would recommend waiting after the server has closed the connection (not
maintained the persistent connection), as long as the connection is open
sending another request instead of waiting and keeping the connection open
but inactive is the best choice.

>
> B) The number of robots you are running (e.g. 30 seconds per site per
> robot, or 30 seconds per site across all your robots?)

Running multiple robots in parallel increases the number of open connections
required at the server, a single persistent connection is more
server-friendly (and usually easier to manage too, I see some sites crawl
the same resources in parallel robots, which apparantly do not communicate
status information in real-time).

> C) The number of active robots on the Web (e.g. 1000 robots isn't many,
> 10 million robots is - and if too many unrelated robots hit a site,
> that's another effective DDOS attack)

The number of other robots hitting a site is not a known factor, although
performance metrics can give an indiciation whether or not a site is under
heavy load.

> D) Some other factor (e.g. server response time, etc.)

> E) None of the above (i.e. anything goes)

DOS monitors may raise alerts or block traffic if robots hit a site too
hard, too frequently, with too many parallel process etc.


--
Klaus Johannes Rusch
[EMAIL PROTECTED]
http://www.atmedia.net/KlausRusch/


_______________________________________________
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots

Reply via email to