10 billion = 2000 search servers + ~500-1000 processing machines +
100Mbps line. Ballpark 3-4 million for the servers with a 30-50K+ a
month bandwidth and electricity cost for datacenters. I am assuming 5M
pages per search server.
Search servers would be small disk space but large ram, say 8G+ each.
Processing machines would be 500G+ disks, more likely the newer 2-3x 1T
disks and 8G ECC memory with multi-core (probably quad core) processors.
At a previous company we did 100M page system for 35K in hardware and
2,500 a month hosting charges. Second tier search engine now have
somewhere around 4B pages.
Hope this helps.
Dennis
Shef wrote:
This is a big question...
What kind of resources are required for doing a crawl of the whole web? I'm
just looking for ballpark numbers -- servers, bandwidth, cost, etc.
Assumptions:
"Whole web" means roughly the same number of pages crawled by second or
third-tier search engines (which is what we're thinking about building). I'm
not sure how many pages that is. 10 billion, maybe?
Timeframe: the crawl should take the same amount of time that the minor
search engines take. Maybe a month or two? Fast-changing sites refreshed
more frequently, static sites less so.
Cost: we could get a rough idea if we knew the number of servers, amount of
disk per server, and the required bandwidth. It's not too tough to find the
cost of renting the cabinets in a data center to do this.
Another big cost would be the engineers to build and maintain it. Perhaps
two or three people, full time, supplemented by 24x7 data center support?
I know I'm leaving out a lot of variables, but I'm really just looking for
order-of-magnitude numbers. Replies from people who have actually done it,
with their actual experiences, would be greatly appreciated.