Insurance Squared Inc. wrote:
Well, just very roughly:
4billion pages X 20K per page / 1000K per meg / 1000 megs per gig =
80,000 gigs of data transfer every month.
100mbs connection /8 megabits per megabyte * 60 seconds in a minute *
60seconds in an hour*24 hours in a day *30 hours in a month=32,400
gigs per month.
So you'd need about 3 full 100mbs connections running at 100%
capacity, 24/7. Which as you noted is a huge undertaking.
True but I doubt that you need to download every page continuously and
because of that Nutch need sending Last-modified header and if it's get
response 304 Not Modified. I saw some patches around but I think they
are not in trunk yet.
As a second indicator of the scale, IIRC Doug Cutting posted a while
ago that he downloaded and indexed 50 million pages in a day or two
with about 10 servers.
We download about 100,000 pages per hour on a dedicated 10mbs
connection. Nutch will definitely fill more than a 10mbs connection
though, I scaled the system back to only use 10mbs before I went broke
:).
Could you please send config info and what hardware is used for
crawling. We've manage only 10,000 per hour sometimes less on 100Mbit/s.
--
Uros