[Nutch Wiki] Update of "NutchHadoopTutorial" by DennisKubes

Apache Wiki Fri, 31 Mar 2006 10:31:57 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by DennisKubes:
http://wiki.apache.org/nutch/NutchHadoopTutorial

The comment on the change is:
Added rsyncing code to slave nodes and removed unnecessary env variables

------------------------------------------------------------------------------
  the hadoop-env.sh file:
  
  {{{
- NUTCH_HOME=/nutch/search
- HADOOP_HOME=/nutch/search
+ export HADOOP_HOME=/nutch/search
- 
- JAVA_HOME=/usr/java/jdk1.5.0_06
+ export JAVA_HOME=/usr/java/jdk1.5.0_06
- NUTCH_JAVA_HOME=${JAVA_HOME}
- 
- NUTCH_LOG_DIR=${HADOOP_HOME}/logs
+ export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
- 
- NUTCH_MASTER=devcluster01
- HADOOP_MASTER=devcluster01
- HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves
+ export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves
  }}}
  
  There are other variables in this file which will affect the behavior of 
Hadoop.  If when you start running the script later you start getting ssh 
errors, try changing the HADOOP_SSH_OPTS variable.  Note also that, after the 
initial copy, you can set NUTCH_MASTER in your conf/hadoop-env.sh and it will 
use rsync to update the code running on each slave when you start daemons on 
that slave.
@@ -418, +411 @@

  bin/start-all.sh
  }}}
  
- A command like 'bin/slaves.sh uptime' is a good way to test that things are 
configured correctly before attempting to call the start-all.sh script.
+ '''A command like 'bin/slaves.sh uptime' is a good way to test that things 
are configured correctly before attempting to call the start-all.sh script.'''
  
  The first time all of the nodes are started there may be the ssh dialog 
asking to add the hosts to the known_hosts file.  You will have to type in yes 
for each one and hit enter.  The output may be a little wierd the first time 
but just keep typing yes and hitting enter if the dialogs keep appearing.  You 
should see output showing all the servers starting on the local machine and the 
job tracker and data nodes servers starting on the slave nodes.  Once this is 
complete we are ready to begin our crawl.
  
@@ -497, +490 @@

  Then point you browser to http://devcluster01:8080 (your master node) to see 
the Nutch search web application.  If everything has been configured correctly 
then you should be able to enter queries and get results.
  
  
+ == Rsyncing Code to Slaves ==
+ 
--------------------------------------------------------------------------------
+ Nutch and Hadoop provide the ability to rsync master changes to the slave 
nodes.  This is optional though because it slows down the startup of the 
servers and because you might not want to have changed automatically synced to 
slave nodes.  
+ 
+ If you do want this capability enabled then below I will show you how to 
configure your servers to rsync from the master.  There are a couple of things 
you should know first.  One, even though the slave nodes can rsync from the 
master you still have to copy the base installation over to the slave node the 
first time so that the scripts are available to rsync.  This is the way we did 
it above so that shouldn't require any changeds  Two the way the rsync happens 
is that the master node does an ssh into the slave node and calls 
bin/hadoop-daemon.sh.  The script on the slave node then calls the rsync back 
to the master node.  What this means is that you have to have a password-less 
login from each of the slave nodes to the master node.  Before we setup 
password-less login from the master to the slaves, now we need to do the 
reverse.  Three, if you have problems with the rsync options (I did and I had 
to change the options because I am running an older version of ssh), look in t
 he bin/hadoop-daemon.sh script around line 82 for where it calls the rsync 
command.  
+ 
+ So the first thing we need to do is setup the hadoop master variable in the 
conf/hadoop-env.sh file.  The variable will need to look like this:
+ 
+ {{{
+ export HADOOP_MASTER=devcluster01:/nutch/search
+ }}}
+ 
+ This will need to be copied to all of the slave nodes like this:
+ 
+ {{{
+ scp /nutch/home/conf/hadoop-env.sh [EMAIL 
PROTECTED]:/nutch/home/conf/hadoop-env.sh
+ }}}
+ 
+ And finally you will need to log into each of the slave nodes, create a 
default ssh key for each machine and then copy it back to the master node where 
you will append it to the /nutch/home/.ssh/authorized_keys file.  Here are the 
commands for each slave node, be sure to change the slavenodename when you copy 
the key file back to the master node so you don't overwrite files:
+ 
+ {{{
+ ssh -l nutch devcluster02
+ cd /nutch/home/.ssh
+ 
+ ssh-keygen -t rsa (Use empty responses for each prompt)
+   Enter passphrase (empty for no passphrase): 
+   Enter same passphrase again: 
+   Your identification has been saved in /nutch/home/.ssh/id_rsa.
+   Your public key has been saved in /nutch/home/.ssh/id_rsa.pub.
+   The key fingerprint is:
+   a6:5c:c3:eb:18:94:0b:06:a1:a6:29:58:fa:80:0a:bc [EMAIL PROTECTED]
+ 
+ scp id_rsa.pub [EMAIL PROTECTED]:/nutch/home/devcluster02.pub
+ }}}
+ 
+ Once you have done that for each of the slave nodes you can append the files 
to the authorized_keys file on the master node:
+ 
+ {{{
+ cd /nutch/home
+ cat devcluster*.pub >> .ssh/authorized_keys
+ }}}
+ 
+ With this setup whenever you run the bin/start-all.sh script files should be 
synced from the master node to each of the slave nodes.
+ 
+ 
  == Conclusion ==
  
--------------------------------------------------------------------------------
  I know this has been a lengthy tutorial but hopefully it has gotten you 
familiar with both nutch and hadoop.  Both Nutch and Hadoop are complicated 
applications and setting them up as you have learned is not necessarily an easy 
task.  I hope that this document has helped to make it easier for you.

[Nutch Wiki] Update of "NutchHadoopTutorial" by DennisKubes

Reply via email to