Hello Robots list Well maybe this list can finally put to rest a great deal of the "30 second wait" issue.
Can we all collectively research into an adaptive routine? We all need a common code routine that all our spidering modules and connective programs can use. Especially when we wish to get as close to the Ethernet optimum (or about 80% of true max, I believe) without getting ourselves into the DoS Zone ( >80% of Ethernet max ), where signal collisions will start failures and the repeat signals and competing signals will effectively collapse the Ethernet communications medium. Can we not, therefore, settle the issue of finding the balancing point in determining optimum throughput from networks and servers at any given time? Can we not determine the optimum mathematical formula, then program this into our libraries of code; so our spiders can all follow this formula? So in this effort: Has anyone found, started to build , or can recommend the building blocks, of an such adaptive routine? Can this list supply us all with THE defacto real time adaptive throttling routine? A routine that will track and adapt to the ever changing conditions, by taking in real time network measurements, feeding them through the formula, and the result is optimum wait time, before connecting to the same server again. The wait time resets after each ACK package from the target server. Any formula suggestions? One of the variables in the formula, should be from our spider configs initially set through users input, as some users will need to max out their dedicated network communication lines (such as adapter card to adapter card, isolation work of very controlled networks). Suggest a "0" input to do this work. The default setting or "1" , will result inn the optimal time determined by the formula. Any other integer would simply multiple the time delay between server connections. In this way the user could throttle it down to the needs of the local network and servers. -Thomas Kay -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: 2003-11-04 10:21 AM To: [EMAIL PROTECTED]; Internet robots, spiders, web-walkers, etc. Subject: [Robots] Hit Rate - testing is this mailing linst alive? Alan Perkins writes: > What's the current accepted practice for hit rate? In general, leave an interval several times longer than the time taken for the last response. e.g. if a site responds in 20 ms, you can hit it again the same second. If a site takes 4 seconds to response, leave it at least 30 seconds before trying again. > B) The number of robots you are running (e.g. 30 seconds per site per > robot, or 30 seconds per site across all your robots?) Generally, take into account all your robots. If you use a mercator style distribution strategy, this is a non-issue. > D) Some other factor (e.g. server response time, etc.) Server response time is the biggest factor. > E) None of the above (i.e. anything goes) > > It's clear from the log files I study that some of the big players are > not sticking to 30 seconds. There are good reasons for this and I > consider it a good thing (in moderation). E.g. retrieving one page from > a site every 30 seconds only allows 2880 pages per day to be retrieved > from a site and this has obvious "freshness" implications when indexing > large sites. Many large sites are split across several servers. Often these can be hit in parallel - if your robot is clever enough. Richard _______________________________________________ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots _______________________________________________ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots