Greetings, I'm totally loving our new Hadoop/Nutch cluster. It's so rewarding to work with :)
We're trying to figure out the best way to utilize the servers we have been allocated for our Nutch/Hadoop project. Currently, we do two things: 1. Crawl and index pages using Nutch/Hadoop, and then 2. Parse them into a universal XML format using something we've written in C#. I'm trying to figure out the best way to utilize about 10 machines. They're pretty good Dell server-class machines: dual processor, 4 GB of RAM, 500 GB hard drive. The Nutch crawls are pretty much running 24/7. We only parse about 1-5% of what we crawl. Given that information, should we: 1. Have all 10 machines crawling the web, and also rewrite our C# class to work with a map-reduce, and run it on those machines at the same time? How much overhead does Nutch typically use? It doesn't seem like much to me... -or- 2. Have 5 Linux machines crawling the web, copy the crawled pages to 5 other machines, and have them parsed with the re-written map-reduce C# code -or- 3. Have 5 Linux machines crawling the web, and have 5 Windows Servers request the crawled pages we need via a SOAP-type interface (I've written this), and then parse it using our existing (non map-reduce) framework? We're looking for the best solution, so just consider all things equal (engineering time, etc). I know this is pretty vague, so feel free to ask questions :P
