Scott Simpson wrote:
Excuse my ignorance on this issue. Say I have 5 machines in my Hadoop cluster and I only list two of them in the configuration file when I do a "fetch" or a "generate". Won't this just store the data on the two nodes since that is all I've listed for my crawling machines? I'm trying to crawl on two but store my data across all five.
So you want to use different sets of machines for dfs than for MapReduce? An easy way to achieve this is to install Hadoop separately and start dfs only there ('bin/hadoop-daemon.sh start namenode; bin/hadoop-daemons.sh start datanode', or use the new bin/start-dfs.sh script). Then, in your Nutch installation, start only the MapReduce daemons, using a different conf/slaves file ('bin/hadoop-daemon.sh start jobtracker; bin/hadoop-daemons.sh start tasktracker', or use the new bin/start-mapred.sh script). Just make sure that your Nutch installation is configured to talk to the same namenode as your Hadoop installation, and make sure that you don't run bin/start-all.sh from either installation. Does that make sense?
Doug