Greetings,

I'm totally loving our new Hadoop/Nutch cluster. It's so rewarding to
work with :)

We're trying to figure out the best way to utilize the servers we have
been allocated for our Nutch/Hadoop project. Currently, we do two
things:
1. Crawl and index pages using Nutch/Hadoop, and then
2. Parse them into a universal XML format using something we've written in C#.

I'm trying to figure out the best way to utilize about 10 machines.
They're pretty good Dell server-class machines: dual processor, 4 GB
of RAM, 500 GB hard drive. The Nutch crawls are pretty much running
24/7.

We only parse about 1-5% of what we crawl.

Given that information, should we:

1. Have all 10 machines crawling the web, and also rewrite our C#
class to work with a map-reduce, and run it on those machines at the
same time? How much overhead does Nutch typically use? It doesn't seem
like much to me...
-or-
2. Have 5 Linux machines crawling the web, copy the crawled pages to 5
other machines, and have them parsed with the re-written map-reduce C#
code
-or-
3. Have 5 Linux machines crawling the web, and have 5 Windows Servers
request the crawled pages we need via a SOAP-type interface (I've
written this), and then parse it using our existing (non map-reduce)
framework?

We're looking for the best solution, so just consider all things equal
(engineering time, etc).

I know this is pretty vague, so feel free to ask questions :P

Reply via email to