Hi everyone I've got a few questions on Nutch, we're using the latest release version - 0.9.
We're looking to crawl, index and search roughly 20,000 websites. Based on the example of roughly 10KB per page, and a conservative estimate of 100 pages per site (since the sites will be commercial in nature, my guess is that most will have more than that), then it equates to about 20GB of storage required. Does anyone have some stats on roughly how much bandwidth is required to crawl this amount of sites once? We've tried to start out small, but already our dedicated server host is complaining about outbound bandwidth and breaking terms of use. Are there any recommended hosts for crawling using Nutch, or can anyone recommend hosting particulars to look out for? Lastly, we'd like to connect to our nutch install via an API, so we can add more content to our results. Our main application is a standard java web application with spring, hibernate, mysql etc sitting pretty on tomcat. We also (currently) run Nutch on the same tomcat install. What's the best way to communicate to the nutch install from our current application to provide it with search engine capabilities? Does nutch include web services, or can we use RMI? Thank you for any feedback, it would be greatly appreciated. Cheers Aled
