This is a big picture question on what kind of money and effort it would require to do a full web crawl. By "full web crawl" I mean fetching the top four billion or so pages and keeping them reasonably fresh, with most pages no more than a month out of date.
I know this is a huge undertaking. I just want to get ballpark numbers on the required number of servers and required bandwidth. Also, is it even possible to do with Nutch? How much custom coding would be required? Are there other crawlers that may be appropriate, like Heretrix? We're looking into doing a giant text mining app. We'd like to have a large database of web pages available for analysis. All we need to do is fetch and store the pages. We're not talking about running a search engine on top of it. ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
