Re: [Nutch-general] Crawling the entire web -- what's involved?

Insurance Squared Inc. Wed, 09 Aug 2006 11:33:52 -0700

Well, just very roughly:
4billion pages X 20K per page / 1000K per meg / 1000 megs per gig = 
80,000 gigs of data transfer every month.

100mbs connection /8 megabits per megabyte * 60 seconds in a minute * 
60seconds in an hour*24 hours in a day *30 hours in a month=32,400 gigs 
per month. 

So you'd need about 3 full 100mbs connections running at 100% capacity, 
24/7.  Which as you noted is a huge undertaking. 

As a second indicator of the scale, IIRC Doug Cutting posted a while ago 
that he downloaded and indexed 50 million pages in a day or two with 
about 10 servers. 

We download about 100,000 pages per hour on a dedicated 10mbs 
connection.  Nutch will definitely fill more than a 10mbs connection 
though, I scaled the system back to only use 10mbs before I went broke :).

Hopefully those three points will give you an indication of scale.  Of 
course you still have the problem of storing 80,000 gigs of data - and 
even more vast problem of doing something with it. 

You'll have to investigate further, but the next hurdle is that the data 
isn't likely easily accesible in the format you want.

What you may consider doing is renting out someone else's data.  I think 
Alexa or Ask (one of those two) has made their entire database available 
for a fee based on cpu cycles or something though the data was three 
months old.  Or you could try a smaller search engine like gigablast or 
someone that might share their data with you for a fee.

-g.

It's possible nutch 0.8 will do this since it's set up for distributed 
computing. 

Chris wrote:

> This is a big picture question on what kind of money and effort it 
> would require to do a full web crawl. By "full web crawl" I mean 
> fetching the top four billion or so pages and keeping them reasonably 
> fresh, with most pages no more than a month out of date.
>
> I know this is a huge undertaking. I just want to get ballpark numbers 
> on the required number of servers and required bandwidth.
>
> Also, is it even possible to do with Nutch? How much custom coding 
> would  be required? Are there other crawlers that may be appropriate, 
> like Heretrix?
>
> We're looking into doing a giant text mining app. We'd like to have a 
> large database of web pages available for analysis. All we need to do 
> is fetch and store the pages. We're not talking about running a search 
> engine on top of it.
>
>
>

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Crawling the entire web -- what's involved?

Reply via email to