[Nutch-general] Crawling the entire web -- what's involved?

Chris Wed, 09 Aug 2006 10:50:49 -0700

This is a big picture question on what kind of money and effort it would 
require to do a full web crawl. By "full web crawl" I mean fetching the 
top four billion or so pages and keeping them reasonably fresh, with 
most pages no more than a month out of date.


I know this is a huge undertaking. I just want to get ballpark numbers 
on the required number of servers and required bandwidth.

Also, is it even possible to do with Nutch? How much custom coding would 
  be required? Are there other crawlers that may be appropriate, like 
Heretrix?

We're looking into doing a giant text mining app. We'd like to have a 
large database of web pages available for analysis. All we need to do is 
fetch and store the pages. We're not talking about running a search 
engine on top of it.



-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Crawling the entire web -- what's involved?

Reply via email to