Insurance Squared Inc. wrote: > Well, just very roughly: > 4billion pages X 20K per page / 1000K per meg / 1000 megs per gig = > 80,000 gigs of data transfer every month. > > 100mbs connection /8 megabits per megabyte * 60 seconds in a minute * > 60seconds in an hour*24 hours in a day *30 hours in a month=32,400 > gigs per month. > So you'd need about 3 full 100mbs connections running at 100% > capacity, 24/7. Which as you noted is a huge undertaking. True but I doubt that you need to download every page continuously and because of that Nutch need sending Last-modified header and if it's get response 304 Not Modified. I saw some patches around but I think they are not in trunk yet. > As a second indicator of the scale, IIRC Doug Cutting posted a while > ago that he downloaded and indexed 50 million pages in a day or two > with about 10 servers. > We download about 100,000 pages per hour on a dedicated 10mbs > connection. Nutch will definitely fill more than a 10mbs connection > though, I scaled the system back to only use 10mbs before I went broke > :). > Could you please send config info and what hardware is used for crawling. We've manage only 10,000 per hour sometimes less on 100Mbit/s.
-- Uros ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
