Re: Crawling the entire web -- what's involved?

ken Wed, 09 Aug 2006 11:17:48 -0700

> This is a big picture question on what kind of money and effort it would
> require to do a full web crawl. By "full web crawl" I mean fetching the
> top four billion or so pages and keeping them reasonably fresh, with
> most pages no more than a month out of date.
>
> I know this is a huge undertaking. I just want to get ballpark numbers
> on the required number of servers and required bandwidth.
>
> Also, is it even possible to do with Nutch? How much custom coding would
>   be required? Are there other crawlers that may be appropriate, like
> Heretrix?
>
> We're looking into doing a giant text mining app. We'd like to have a
> large database of web pages available for analysis. All we need to do is
> fetch and store the pages. We're not talking about running a search
> engine on top of it.
>
I believe the last count of the number of servers that Google has is
200,000+.
That should give you an indication of the magnitude of crawling the whole
web.

Re: Crawling the entire web -- what's involved?

Reply via email to