Insurance Squared Inc. wrote:
> Well, just very roughly:
> 4billion pages X 20K per page / 1000K per meg / 1000 megs per gig = 
> 80,000 gigs of data transfer every month.
>
> 100mbs connection /8 megabits per megabyte * 60 seconds in a minute * 
> 60seconds in an hour*24 hours in a day *30 hours in a month=32,400 
> gigs per month.
> So you'd need about 3 full 100mbs connections running at 100% 
> capacity, 24/7.  Which as you noted is a huge undertaking.
True but I doubt that you need to download every page continuously and 
because of that Nutch need sending Last-modified header and if it's get 
response 304 Not Modified. I saw some patches around but I think they 
are not in trunk yet.
> As a second indicator of the scale, IIRC Doug Cutting posted a while 
> ago that he downloaded and indexed 50 million pages in a day or two 
> with about 10 servers.
> We download about 100,000 pages per hour on a dedicated 10mbs 
> connection.  Nutch will definitely fill more than a 10mbs connection 
> though, I scaled the system back to only use 10mbs before I went broke 
> :).
>
Could you please send config info and what hardware is used for 
crawling. We've manage only 10,000 per hour sometimes less on 100Mbit/s.

--
Uros


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to