Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "NutchHadoopTutorial" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/NutchHadoopTutorial?action=diff&rev1=34&rev2=35

  
  1) Perform this setup using root level access.  This includes setting up the 
same user across multiple machines and setting up a local filesystem outside of 
the user's home directory.  Root access is not required to setup Nutch and 
Hadoop (although sometimes it is convenient).  If you do not have root access, 
you will need the same user setup across all machines which you are using and 
you will probably need to use a local filesystem inside of your home directory.
  
- 2) All boxes will need an SSH server running (not just a client) as Hadoop 
uses SSH to start slave servers. Although we try to explain how to set up ssh 
so that communication between machines does not require a password you may need 
to learn how to do that elsewhere. Please see halfway down this 
tutorial[[http://help.github.com/linux-set-up-git/|here]]
+ 2) All boxes will need an SSH server running (not just a client) as Hadoop 
uses SSH to start slave servers. Although we try to explain how to set up ssh 
so that communication between machines does not require a password you may need 
to learn how to do that elsewhere. Please see halfway down this 
tutorial[[http://help.github.com/linux-set-up-git/|here]]. N.B. It is important 
that initially you only create keys on the Master node, they are then copied 
over to your Slave nodes.
  
  3) For this tutorial we setup Nutch across a 6 node Hadoop cluster. If you 
are using a different number of machines you should still be fine but you 
should have at least two different machines to prove the distributed 
capabilities of both HDFS and MapReduce.  
  
@@ -44, +44 @@

  
  Both Nutch and Hadoop are downloadable from their respective Apache websites.
  
- You can checkout the latest and greatest Nutch from source after downloading 
it from its SVN repository 
[[http://svn.apache.org/repos/asf/nutch/trunk/|here]]. Alternatively pick up a 
stable release from the Nutch site. The same should be done with Hadoop, and as 
mentioned eariler this, along with how to set up your 6 node cluster is 
included in the [[http://hadoop.apache.org/common/docs/stable/|Hadoop 
Tutorial]].
+ You can checkout the latest and greatest Nutch from source after downloading 
it from its SVN repository 
[[http://svn.apache.org/repos/asf/nutch/trunk/|here]]. Alternatively pick up a 
stable release from the Nutch site. The same should be done with Hadoop, and as 
mentioned eariler, this along with how to set up your 6 node cluster is 
included in the [[http://hadoop.apache.org/common/docs/stable/|Hadoop 
Tutorial]]. This should be done with one every node you wish to include within 
your cluster e.g. both Nutch and Hadoop packages should be installed in every 
machine.
  
- We are going to use ant to build it so if you have java and ant installed you 
should be fine. This tutorial is not going to go into how to install java or 
ant, if you want a complete reference for ant pick up Erik Hatcher's book 
[[http://www.manning.com/hatcher|Java Development with Ant]]
+ We are going to use ant to build it so if you have java and ant installed you 
should be fine. This tutorial is not going to go into how to install java or 
ant, if you want a complete reference for ant pick up Erik Hatcher's book 
[[http://www.manning.com/hatcher|Java Development with Ant]].
- 
- == Building Nutch and Hadoop ==
- 
--------------------------------------------------------------------------------
- Once you have Nutch downloaded and unpacked look inside it where you should 
see the following folders and files:
- 
- {{{
- + bin
- + conf
- + docs
- + lib
- + site
- + src
-     build.properties (add this one)
-     build.xml
-     CHANGES.txt
-     default.properties
-     index.html
-     LICENSE.txt
-     README.txt
- }}}
- 
- Add a build.properties file and inside of it add a variable called dist.dir 
with its value being the location where you want to build nutch. So if you are 
building on a linux machine it would look something  like this:
- 
- {{{
- dist.dir=/path/to/build
- }}}
- 
- This step is actually optional as Nutch will create a build directory inside 
of the directory where you unzipped it by default, but I prefer building it to 
an external directory.  You can name the build directory anything you want but 
I recommend using a new empty folder to build into.  Remember to create the 
build folder if it doesn't already exist.
- 
- To build nutch call the package ant task like this:
- 
- {{{
- ant package
- }}}
- 
- This should build nutch into your build folder.  When it is finished you are 
ready to move on to deploying and configuring nutch.
- 
  
  == Setting Up The Deployment Architecture ==
- 
--------------------------------------------------------------------------------
- Once we get nutch deployed to all six machines we are going to call a script 
called start-all.sh that starts the services on the master node and data nodes. 
 This means that the script is going to start the hadoop daemons on the master 
node and then will ssh into all of the slave nodes and start daemons on the 
slave nodes.
+ Once we get nutch deployed to all six machines we are going to call a script 
called start-all.sh that starts the services on the master node and data nodes. 
This means that the script is going to start the hadoop daemons on the master 
node and then will ssh into all of the slave nodes and start daemons on the 
slave nodes.
  
- The start-all.sh script is going to expect that nutch is installed in exactly 
the same location on every machine.  It is also going to expect that Hadoop is 
storing the data at the exact same filepath on every machine.
+ The start-all.sh script is going to expect that Nutch is installed in exactly 
the same location on every machine.  It is also going to expect that Hadoop is 
storing the data at the exact same filepath on every machine.
  
- The way we did it was to create the following directory structure on  every 
machine.  The search directory is where Nutch is installed.  The filesystem is 
the root of the hadoop filesystem.  The home directory is the nutch users's 
home directory.  On our master node we also  installed a tomcat 5.5 server for 
searching.
+ The start-all.sh script that starts the daemons on the master and slave nodes 
is going to need to be able to use a password-less login through ssh. Since the 
master node is going to start daemons on itself we also need the ability to 
user a password-less login on itself, however this should have already been 
done by now. 
  
+ We need to setup the environment variables inside of the hadoop-env.sh file.  
Open the hadoop-env.sh file inside of vi:
- {{{
- /nutch
-   /search
-     (nutch installation goes here)
-   /filesystem
-   /local (used for local directory for searching)
-   /home
-     (nutch user's home directory)
-   /tomcat    (only on one server for searching)
- }}}
  
- I am not going to go into detail about how to install Tomcat as again there 
are plenty of tutorials on how to do that.  I will say that we removed all of 
the wars from the webapps directory and created a  folder called ROOT under 
webapps into which we unzipped the Nutch war file (nutch-0.8-dev.war).  This 
makes it easy to edit configuration files inside of the Nutch war
- 
- So log into the master nodes and all of the slave nodes as root. Create the 
nutch user and the different filesystems with the following commands:
- 
  {{{
+ cd HADOOP_HOME/conf
- ssh -l root devcluster01
- 
- mkdir /nutch
- mkdir /nutch/search
- mkdir /nutch/filesystem
- mkdir /nutch/local
- mkdir /nutch/home
- 
- groupadd users
- useradd -d /nutch/home -g users nutch
- chown -R nutch:users /nutch
- passwd nutch nutchuserpassword
- }}}
- 
- Again if you don't have root level access you will still need the same user 
on each machine as the start-all.sh script expects it.  It doesn't have to be a 
user named nutch user although that is what we use.  Also you could put the 
filesystem under the common user's home directory.  Basically, you don't have 
to be root, but it helps.
- 
- The start-all.sh script that starts the daemons on the master and slave nodes 
is going to need to be able to use a password-less login through ssh.  For this 
we are going to have to setup ssh keys on each of the nodes.  Since the master 
node is going to start daemons on itself we also need the ability to user a 
password-less login on itself. 
- 
- You might have seen some old tutorials or information floating around the 
user lists that said you would need to edit the SSH daemon to allow the 
property PermitUserEnvironment and to setup local environment variables for the 
ssh logins through an environment file.  This has changed.  We no longer need 
to edit the ssh daemon and we can setup the environment variables inside of the 
hadoop-env.sh file.  Open the hadoop-env.sh file inside of vi:
- 
- {{{
- cd /nutch/search/conf
  vi hadoop-env.sh
  }}}
  
@@ -138, +66 @@

  the hadoop-env.sh file:
  
  {{{
- export HADOOP_HOME=/nutch/search
- export JAVA_HOME=/usr/java/jdk1.5.0_06
+ export HADOOP_HOME=/PATH/TO/HADOOP_HOME
+ export JAVA_HOME=/PATH/TO/JDK_HOME
  export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
- export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves
  }}}
+ Additionally at this stage, in accordance with the Hadoop tutorial, add the 
IP addresses of your Master and Slaves nodes to HADOOP_HOME/conf/masters & 
HADOOP_HOME/conf/slaves respectively.
  
- There are other variables in this file which will affect the behavior of 
Hadoop.  If when you start running the script later you start getting ssh 
errors, try changing the HADOOP_SSH_OPTS variable.  Note also that, after the 
initial copy, you can set HADOOP_MASTER in your conf/hadoop-env.sh and it will 
use rsync changes on the master to each slave node.  There is a section below 
on how to do this.
+ There are numerous other variables (documented elsewhere) in the 
HADOOP_HOME/conf directory which will affect the behaviour of Hadoop. If when 
you start running the script later you start getting ssh errors, try changing 
the HADOOP_SSH_OPTS variable. 
  
- Next we are going to create the keys on the master node and copy them over to 
each of the slave nodes.  This must be done as the nutch user we created 
earlier.  Don't just su in as the nutch user, start up a new shell and  login 
as the nutch user.  If you su in the password-less login we are about to setup 
will not work in testing but will work when a new session is started as the 
nutch user. 
+ Next we are going to create the keys on the master node and copy them over to 
each of the slave nodes.  This must be done as the Nutch root user we created 
earlier.  Don't just sudo in as the Nutch user, start up a new shell and  login 
as the Nutch user.  If you sudo in the password-less login we are about to 
setup will not work in testing but will work when a new session is started as 
the Nutch user. 
  
  {{{
- cd /nutch/home
+ cd NUTCH_HOME
  
  ssh-keygen -t rsa (Use empty responses for each prompt)
    Enter passphrase (empty for no passphrase): 
@@ -163, +91 @@

  On the master node you will copy the public key you just created to a file 
called authorized_keys in the same directory:
  
  {{{
- cd /nutch/home/.ssh
+ cd NUTCH_HOME/.ssh
  cp id_rsa.pub authorized_keys
  }}}
  
  You only have to run the ssh-keygen on the master node.  On each of the slave 
nodes after the filesystem is created you will just need to copy the keys over 
using scp. eg to send the authorisation from to devcluster02 we might do this 
on devcluster01
  
  {{{
- scp /nutch/home/.ssh/authorized_keys 
nutch@devcluster02:/nutch/home/.ssh/authorized_keys
+ scp NUTCH_HOME/.ssh/authorized_keys 
nutch@devcluster02:NUTCH_HOME/.ssh/authorized_keys
  }}}
  
- You will have to enter the password for the nutch user the first time. An ssh 
prompt will appear the first time you login to each computer  asking if you 
want to add the computer to the known hosts.  Answer yes to  the prompt.  Once 
the key is copied you shouldn't have to enter a password  when logging in as 
the nutch user.  Test it by logging into the slave nodes that you just copied 
the keys to:
+ You will have to enter the password for the Nutch user the first time. A ssh 
prompt will appear the first time you login to each computer  asking if you 
want to add the computer to the known hosts.  Answer yes to the prompt. Once 
the key is copied you shouldn't have to enter a password  when logging in as 
the Nutch root user.  Test it by logging into the slave nodes that you just 
copied the keys to:
  
  {{{
  ssh devcluster02
@@ -181, +109 @@

  hostname (should return the name of the slave node, here devcluster02)
  }}}
  
- Once we have the ssh keys created we are ready to start deploying nutch to 
all of the slave nodes.
+ Once we have the ssh keys created we are ready to start deploying Nutch to 
all of the slave nodes.
  
- (Note: this is a rather simple example of how to set up ssh without requiring 
a passphrase. There are other documents available which can help you with this 
if you have problems. It is important to test that the nutch user can ssh to 
all of the machines in your cluster so don't skip this stage)
+ (Note: this is a rather simple example of how to set up ssh without requiring 
a passphrase. There are other documents available which can help you with this 
if you have problems. It is important to test that the Nutch user can ssh to 
all of the machines in your cluster so don't skip this stage)
- 
  
  == Deploy Nutch to Single Machine ==
  
--------------------------------------------------------------------------------

Reply via email to