As a second indicator of the scale, IIRC Doug Cutting posted a while
ago that he downloaded and indexed 50 million pages in a day or two
with about 10 servers.
We download about 100,000 pages per hour on a dedicated 10mbs
connection. Nutch will definitely fill more than a 10mbs connection
though, I scaled the system back to only use 10mbs before I went
broke :).
Could you please send config info and what hardware is used for
crawling. We've manage only 10,000 per hour sometimes less on 100Mbit/s.
--
For that I'm using a Dell 1750 with dual Xeon's and 8gigs of ram.
Though I can get the same with only a single p4 processor. You've
likely got one of two issues. First is you don't actually have a 100mbs
connection; somewhere there's a bottleneck. Secondly, watch the limit
on the size of the files you crawl. I think we limit our file size to
64K. If you have that limit too big you end up spending all day
downloading 10meg pdf's; that'll really slow things down.