Hadoop

Bradford Stephens Thu, 03 Apr 2008 10:43:30 -0700

Greetings,

I'm using Hadoop for more than just Nutch, so I decided to separate
the two, following the instructions I found here:
http://www.mail-archive.com/[email protected]/msg10225.html


It seems to be mostly working -- I'm running Nutch 0.9 on Hadoop
0.16.1. I'm running it on one namenode and four slaves on Ubuntu
Server. Storing on the DFS seems to work fine. However, it seems to be
crawling from only the namenode, which is where I kick the nutch task
off using  "hadoop/bin/nutch crawl" (I moved the nutch script to the
hadoop bin and pointed the paths in the script to the correct
location). A netstat on the namenode shows me connecting to WWW
servers, however, a netstat on a slave node shows a bunch of
connections to only the namenode.

What am I missing here?

I appreciate your help in advance! :)

Cheers,
Bradford

Note: I've included hadoop-site.xml and nutch-site.xml
~~~~~~~~~~~~~~
hadoop-site.xml:

<configuration>

<property>
   <name>mapred.speculative.execution</name>
   <value>false</value>
</property>


<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/visibleuser/search/hadoop/tmp</value>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://dttest01:54310</value>
</property>

<property>
  <name>mapred.job.tracker</name>
  <value>dttest01:54311</value>
</property>

<property>
  <name>dfs.replication</name>
  <value>3</value>
</property>

</configuration>

~~~~~~~~~~~~~~~~~
nutch-site.xml:

<configuration>

<property>
  <name>http.agent.name</name>
  <value>AgentName</value>
</property>

<property>
  <name>http.agent.description</name>
  <value>agentdesc</value>
</property>

<property>
  <name>http.agent.url</name>
  <value>www.useragent.com</value>
</property>

<property>
  <name>http.agent.email</name>
  <value>[EMAIL PROTECTED]</value>
</property>


</configuration>

Difficulty w/ Distributed Crawl with Separate Nutch/Hadoop

Reply via email to