Insurance Squared Inc. wrote:
Well, just very roughly:
4billion pages X 20K per page / 1000K per meg / 1000 megs per gig = 80,000 gigs of data transfer every month.

100mbs connection /8 megabits per megabyte * 60 seconds in a minute * 60seconds in an hour*24 hours in a day *30 hours in a month=32,400 gigs per month. So you'd need about 3 full 100mbs connections running at 100% capacity, 24/7. Which as you noted is a huge undertaking.
True but I doubt that you need to download every page continuously and because of that Nutch need sending Last-modified header and if it's get response 304 Not Modified. I saw some patches around but I think they are not in trunk yet.
As a second indicator of the scale, IIRC Doug Cutting posted a while ago that he downloaded and indexed 50 million pages in a day or two with about 10 servers. We download about 100,000 pages per hour on a dedicated 10mbs connection. Nutch will definitely fill more than a 10mbs connection though, I scaled the system back to only use 10mbs before I went broke :).

Could you please send config info and what hardware is used for crawling. We've manage only 10,000 per hour sometimes less on 100Mbit/s.

--
Uros

Reply via email to