> This is a big picture question on what kind of money and effort it would > require to do a full web crawl. By "full web crawl" I mean fetching the > top four billion or so pages and keeping them reasonably fresh, with > most pages no more than a month out of date. > > I know this is a huge undertaking. I just want to get ballpark numbers > on the required number of servers and required bandwidth. > > Also, is it even possible to do with Nutch? How much custom coding would > be required? Are there other crawlers that may be appropriate, like > Heretrix? > > We're looking into doing a giant text mining app. We'd like to have a > large database of web pages available for analysis. All we need to do is > fetch and store the pages. We're not talking about running a search > engine on top of it. > I believe the last count of the number of servers that Google has is 200,000+. That should give you an indication of the magnitude of crawling the whole web.
------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
