Greetings, I'm using Hadoop for more than just Nutch, so I decided to separate the two, following the instructions I found here: http://www.mail-archive.com/[email protected]/msg10225.html
It seems to be mostly working -- I'm running Nutch 0.9 on Hadoop 0.16.1. I'm running it on one namenode and four slaves on Ubuntu Server. Storing on the DFS seems to work fine. However, it seems to be crawling from only the namenode, which is where I kick the nutch task off using "hadoop/bin/nutch crawl" (I moved the nutch script to the hadoop bin and pointed the paths in the script to the correct location). A netstat on the namenode shows me connecting to WWW servers, however, a netstat on a slave node shows a bunch of connections to only the namenode. What am I missing here? I appreciate your help in advance! :) Cheers, Bradford Note: I've included hadoop-site.xml and nutch-site.xml ~~~~~~~~~~~~~~ hadoop-site.xml: <configuration> <property> <name>mapred.speculative.execution</name> <value>false</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/visibleuser/search/hadoop/tmp</value> </property> <property> <name>fs.default.name</name> <value>hdfs://dttest01:54310</value> </property> <property> <name>mapred.job.tracker</name> <value>dttest01:54311</value> </property> <property> <name>dfs.replication</name> <value>3</value> </property> </configuration> ~~~~~~~~~~~~~~~~~ nutch-site.xml: <configuration> <property> <name>http.agent.name</name> <value>AgentName</value> </property> <property> <name>http.agent.description</name> <value>agentdesc</value> </property> <property> <name>http.agent.url</name> <value>www.useragent.com</value> </property> <property> <name>http.agent.email</name> <value>[EMAIL PROTECTED]</value> </property> </configuration>
