Scott Simpson wrote:
Excuse my ignorance on this issue. Say I have 5 machines in my Hadoop
cluster and I only list two of them in the configuration file when I do a
"fetch" or a "generate". Won't this just store the data on the two nodes
since that is all I've listed for my crawling machines? I'm trying to crawl
on two but store my data across all five.

So you want to use different sets of machines for dfs than for MapReduce? An easy way to achieve this is to install Hadoop separately and start dfs only there ('bin/hadoop-daemon.sh start namenode; bin/hadoop-daemons.sh start datanode', or use the new bin/start-dfs.sh script). Then, in your Nutch installation, start only the MapReduce daemons, using a different conf/slaves file ('bin/hadoop-daemon.sh start jobtracker; bin/hadoop-daemons.sh start tasktracker', or use the new bin/start-mapred.sh script). Just make sure that your Nutch installation is configured to talk to the same namenode as your Hadoop installation, and make sure that you don't run bin/start-all.sh from either installation. Does that make sense?

Doug

Reply via email to