Hello Robots list

Well maybe this list can finally put to rest a great deal of the "30 second wait" 
issue.

Can we all collectively research into an adaptive routine?

We all need a common code routine that all our spidering modules and connective 
programs can use.  

Especially when we wish to get as close to the Ethernet optimum (or about 80% of true 
max, I believe) without getting ourselves into the DoS Zone ( >80% of Ethernet max ), 
where signal collisions will start failures and the repeat signals and competing 
signals will effectively collapse the Ethernet communications medium.  

Can we not, therefore, settle the issue of finding the balancing point in determining 
optimum throughput from networks and servers at any given time?   

Can we not determine the optimum mathematical formula, then program this into our 
libraries of code; so our spiders can all follow this formula?

So in this effort: Has anyone found, started to build , or can recommend the building 
blocks, of an such adaptive routine?

Can this list supply us all with THE defacto real time adaptive throttling routine?  

A routine that will track and adapt to the ever changing conditions, by taking in real 
time network measurements, feeding them through the formula, and the result is optimum 
wait time, before connecting to the same server again.  The wait time resets after 
each ACK package from the target server. 

Any formula suggestions?

One of the variables in the formula, should be from our spider configs initially set 
through users input, as some users will need to max out their dedicated network 
communication lines (such as adapter card to adapter card, isolation work of very 
controlled networks). Suggest a "0" input to do this work.  The default setting or "1" 
, will result inn the optimal time determined by the formula.  Any other integer would 
simply multiple the time delay between server connections.  In this way the user could 
throttle it down to the needs of the local network and servers.  

-Thomas Kay



-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: 2003-11-04 10:21 AM
To: [EMAIL PROTECTED]; Internet robots, spiders, web-walkers, etc.
Subject: [Robots] Hit Rate - testing is this mailing linst alive?


Alan Perkins writes:
 > What's the current accepted practice for hit rate?

In general, leave an interval several times longer than the time
taken for the last response. e.g. if a site responds in 20 ms,
you can hit it again the same second. If a site takes 4 seconds
to response, leave it at least 30 seconds before trying again.

 > B) The number of robots you are running (e.g. 30 seconds per site per
 > robot, or 30 seconds per site across all your robots?)

Generally, take into account all your robots. If you use a mercator
style distribution strategy, this is a non-issue.

 > D) Some other factor (e.g. server response time, etc.)

Server response time is the biggest factor.

 > E) None of the above (i.e. anything goes)
 > 
 > It's clear from the log files I study that some of the big players are
 > not sticking to 30 seconds.  There are good reasons for this and I
 > consider it a good thing (in moderation).  E.g. retrieving one page from
 > a site every 30 seconds only allows 2880 pages per day to be retrieved
 > from a site and this has obvious "freshness" implications when indexing
 > large sites.

Many large sites are split across several servers. Often these can be
hit in parallel - if your robot is clever enough.

Richard
_______________________________________________
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots
_______________________________________________
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots

Reply via email to