Re: [Robots] Hit Rate - testing is this mailing linst alive?

2003-11-04 Thread Jaakko Hyvätti
On Tue, 4 Nov 2003, Alan Perkins wrote:
 Here's a question to test whether the list is alive and active...

  I have a feeling the bandwidth and other resources of web sites have
gone up so much that really robots do not pose a DoS threat any more. Hit
me as hard as you like as long as I am in your index.  It is spam and
viri that steal the attention and are orders of magnitude worse problems
for everybody.

  So, apparently, all problems of robots were solved, and discussion died
away.  But no need to rush closing the list, in case at some point
something new appears.

Jaakko

-- 
Foreca Ltd   [EMAIL PROTECTED]
Pursimiehenkatu 29-31 B, FIN-00150 Helsinki, Finland http://www.foreca.com
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


Re: [Robots] Hit Rate - testing is this mailing linst alive?

2003-11-04 Thread Christian Storm
I thought I would post some of my experience with download rates

We have built a large scale crawler that has crawled over 2.4 billion urls and continues 
to crawl at upwards of 500 pages/second.  In tuning the download policy we
found that both the hit rate and number of pages downloaded per day both come into play
when trying to tread lightly.  An easy but delayed measure of whether you are treading 
lightly or not is to monitor such sites as www.webmasterworld.com and the like. A more 
direct measure is the volume and types of complaints that come over email.

From our experience, the bulk of the complaints come from the webmasters/businesses/etc. 
who purchased 1-5 Gb of traffic per month but have a site
consisting of thousands if not tens of thousands of pages.  We were quick to find out
that there are *many* of these folks out on the Internet.  The problem is obvious.  If
the crawler downloads the whole site in a shot (even with a 30 second delay) the 
aggregrate bandwidth usage sometimes puts that entity over their alloted limit causing
their ISP to charge them extra.  Guess who's to blame in that circumstance?  Although we 
have always adhered to a 30 second policy, which I believe is very conservative in 2004,
we still receive the you-are-hitting-our-site-to-hard type of complaints.  Usually these
arise when we touch upon to many 404s and the webmaster has decided to have their web
server email them every time one is encountered.

Just thought I'd pass along some information from the trenches

--
Christian Storm, Ph.D.
www.turnitin.com
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


RE: [Robots] Hit Rate - testing is this mailing linst alive?

2003-11-04 Thread thomas.kay
Hello Robots list

Well maybe this list can finally put to rest a great deal of the 30 second wait 
issue.

Can we all collectively research into an adaptive routine?

We all need a common code routine that all our spidering modules and connective 
programs can use.  

Especially when we wish to get as close to the Ethernet optimum (or about 80% of true 
max, I believe) without getting ourselves into the DoS Zone ( 80% of Ethernet max ), 
where signal collisions will start failures and the repeat signals and competing 
signals will effectively collapse the Ethernet communications medium.  

Can we not, therefore, settle the issue of finding the balancing point in determining 
optimum throughput from networks and servers at any given time?   

Can we not determine the optimum mathematical formula, then program this into our 
libraries of code; so our spiders can all follow this formula?

So in this effort: Has anyone found, started to build , or can recommend the building 
blocks, of an such adaptive routine?

Can this list supply us all with THE defacto real time adaptive throttling routine?  

A routine that will track and adapt to the ever changing conditions, by taking in real 
time network measurements, feeding them through the formula, and the result is optimum 
wait time, before connecting to the same server again.  The wait time resets after 
each ACK package from the target server. 

Any formula suggestions?

One of the variables in the formula, should be from our spider configs initially set 
through users input, as some users will need to max out their dedicated network 
communication lines (such as adapter card to adapter card, isolation work of very 
controlled networks). Suggest a 0 input to do this work.  The default setting or 1 
, will result inn the optimal time determined by the formula.  Any other integer would 
simply multiple the time delay between server connections.  In this way the user could 
throttle it down to the needs of the local network and servers.  

-Thomas Kay



-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: 2003-11-04 10:21 AM
To: [EMAIL PROTECTED]; Internet robots, spiders, web-walkers, etc.
Subject: [Robots] Hit Rate - testing is this mailing linst alive?


Alan Perkins writes:
  What's the current accepted practice for hit rate?

In general, leave an interval several times longer than the time
taken for the last response. e.g. if a site responds in 20 ms,
you can hit it again the same second. If a site takes 4 seconds
to response, leave it at least 30 seconds before trying again.

  B) The number of robots you are running (e.g. 30 seconds per site per
  robot, or 30 seconds per site across all your robots?)

Generally, take into account all your robots. If you use a mercator
style distribution strategy, this is a non-issue.

  D) Some other factor (e.g. server response time, etc.)

Server response time is the biggest factor.

  E) None of the above (i.e. anything goes)
  
  It's clear from the log files I study that some of the big players are
  not sticking to 30 seconds.  There are good reasons for this and I
  consider it a good thing (in moderation).  E.g. retrieving one page from
  a site every 30 seconds only allows 2880 pages per day to be retrieved
  from a site and this has obvious freshness implications when indexing
  large sites.

Many large sites are split across several servers. Often these can be
hit in parallel - if your robot is clever enough.

Richard
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


RE: [Robots] Hit Rate - testing is this mailing linst alive?

2003-11-04 Thread Andrew Daviel
On Tue, 4 Nov 2003 [EMAIL PROTECTED] wrote:

 Hello Robots list
 
 Well maybe this list can finally put to rest a great deal of the 30 second wait 
 issue.
 
 Can we all collectively research into an adaptive routine?

Interesting topic...

With one hat on, I operate one of those little servers with thousands of 
pages. I guess I'm lucky; I don't pay bandwidth and the connection is 
naturally limited to a T-1.

With my other hat on, at TRIUMF we have started to have issues with 
bandwidth management. We now have a gigabit link to the research networks 
with no byte charges, so don't care if someone sucks our site from ESnet
(CERN, Fermilab, Los Alamos etc.).
However, we have a 100Mbit link to commercial backbone and can't afford to 
fill it - P2P is a problem. Our current solution is to limit outgoing
traffic to 1Mbit - except our central webserver and mailserver.
So we would be financially embarrassed if a lot of robots from the 
commercial side all decided to mirror our servers.

I guess what I'm trying to say is that the issue of instantaneous 
hit rate is not really a problem any more, but that volume might be.
However, I guess that the people running robots also have finite storage 
and have to pay for bandwidth, so that perhaps this is a non-problem 
except where there is a serious asymmetry between source and destination.


-- 
Andrew Daviel, TRIUMF, Canada
Tel. +1 (604) 222-7376
[EMAIL PROTECTED]


___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


Re: [Robots] Hit Rate - testing is this mailing linst alive?

2003-11-04 Thread Walter Underwood
--On Tuesday, November 4, 2003 10:05 AM + Alan Perkins [EMAIL PROTECTED] wrote:
 
 What's the current accepted practice for hit rate? 

Ultraseek uses one request at a time for a server with no
extra pause in between. Each file is parsed before sending
the next response, so there is a bit of slack. The spider
requests 25 URLs from a server, then moves on.

This usually works out to one or two requests per second
on a server. If there are network delays or large documents,
it will slow down a lot. For if-modified-since requests with
a not modified response, it can go much faster.

The aggregate spidering rate is higher, because there can
be many spider threads making requests.

wunder
--
Walter Underwood
Principal Architect
Verity Ultraseek

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots