This is a big question...

What kind of resources are required for doing a crawl of the whole web? I'm
just looking for ballpark numbers -- servers, bandwidth, cost, etc.

Assumptions:

"Whole web" means roughly the same number of pages crawled by second or
third-tier search engines (which is what we're thinking about building). I'm
not sure how many pages that is. 10 billion, maybe?

Timeframe: the crawl should take the same amount of time that the minor
search engines take. Maybe a month or two? Fast-changing sites refreshed
more frequently, static sites less so.

Cost: we could get a rough idea if we knew the number of servers, amount of
disk per server, and the required bandwidth. It's not too tough to find the
cost of renting the cabinets in a data center to do this.

Another big cost would be the engineers to build and maintain it. Perhaps
two or three people, full time, supplemented by 24x7 data center support?

I know I'm leaving out a lot of variables, but I'm really just looking for
order-of-magnitude numbers. Replies from people who have actually done it,
with their actual experiences, would be greatly appreciated.


-- 
View this message in context: 
http://www.nabble.com/Resources-required-for-whole-web-crawl--tp16373189p16373189.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to