> This is a big picture question on what kind of money and effort it would
> require to do a full web crawl. By "full web crawl" I mean fetching the
> top four billion or so pages and keeping them reasonably fresh, with
> most pages no more than a month out of date.
>
> I know this is a huge undertaking. I just want to get ballpark numbers
> on the required number of servers and required bandwidth.
>
> Also, is it even possible to do with Nutch? How much custom coding would
>   be required? Are there other crawlers that may be appropriate, like
> Heretrix?
>
> We're looking into doing a giant text mining app. We'd like to have a
> large database of web pages available for analysis. All we need to do is
> fetch and store the pages. We're not talking about running a search
> engine on top of it.
>
I believe the last count of the number of servers that Google has is
200,000+.
That should give you an indication of the magnitude of crawling the whole
web.


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to