Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "NutchHadoopTutorial" page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/NutchHadoopTutorial?action=diff&rev1=34&rev2=35 1) Perform this setup using root level access. This includes setting up the same user across multiple machines and setting up a local filesystem outside of the user's home directory. Root access is not required to setup Nutch and Hadoop (although sometimes it is convenient). If you do not have root access, you will need the same user setup across all machines which you are using and you will probably need to use a local filesystem inside of your home directory. - 2) All boxes will need an SSH server running (not just a client) as Hadoop uses SSH to start slave servers. Although we try to explain how to set up ssh so that communication between machines does not require a password you may need to learn how to do that elsewhere. Please see halfway down this tutorial[[http://help.github.com/linux-set-up-git/|here]] + 2) All boxes will need an SSH server running (not just a client) as Hadoop uses SSH to start slave servers. Although we try to explain how to set up ssh so that communication between machines does not require a password you may need to learn how to do that elsewhere. Please see halfway down this tutorial[[http://help.github.com/linux-set-up-git/|here]]. N.B. It is important that initially you only create keys on the Master node, they are then copied over to your Slave nodes. 3) For this tutorial we setup Nutch across a 6 node Hadoop cluster. If you are using a different number of machines you should still be fine but you should have at least two different machines to prove the distributed capabilities of both HDFS and MapReduce. @@ -44, +44 @@ Both Nutch and Hadoop are downloadable from their respective Apache websites. - You can checkout the latest and greatest Nutch from source after downloading it from its SVN repository [[http://svn.apache.org/repos/asf/nutch/trunk/|here]]. Alternatively pick up a stable release from the Nutch site. The same should be done with Hadoop, and as mentioned eariler this, along with how to set up your 6 node cluster is included in the [[http://hadoop.apache.org/common/docs/stable/|Hadoop Tutorial]]. + You can checkout the latest and greatest Nutch from source after downloading it from its SVN repository [[http://svn.apache.org/repos/asf/nutch/trunk/|here]]. Alternatively pick up a stable release from the Nutch site. The same should be done with Hadoop, and as mentioned eariler, this along with how to set up your 6 node cluster is included in the [[http://hadoop.apache.org/common/docs/stable/|Hadoop Tutorial]]. This should be done with one every node you wish to include within your cluster e.g. both Nutch and Hadoop packages should be installed in every machine. - We are going to use ant to build it so if you have java and ant installed you should be fine. This tutorial is not going to go into how to install java or ant, if you want a complete reference for ant pick up Erik Hatcher's book [[http://www.manning.com/hatcher|Java Development with Ant]] + We are going to use ant to build it so if you have java and ant installed you should be fine. This tutorial is not going to go into how to install java or ant, if you want a complete reference for ant pick up Erik Hatcher's book [[http://www.manning.com/hatcher|Java Development with Ant]]. - - == Building Nutch and Hadoop == - -------------------------------------------------------------------------------- - Once you have Nutch downloaded and unpacked look inside it where you should see the following folders and files: - - {{{ - + bin - + conf - + docs - + lib - + site - + src - build.properties (add this one) - build.xml - CHANGES.txt - default.properties - index.html - LICENSE.txt - README.txt - }}} - - Add a build.properties file and inside of it add a variable called dist.dir with its value being the location where you want to build nutch. So if you are building on a linux machine it would look something like this: - - {{{ - dist.dir=/path/to/build - }}} - - This step is actually optional as Nutch will create a build directory inside of the directory where you unzipped it by default, but I prefer building it to an external directory. You can name the build directory anything you want but I recommend using a new empty folder to build into. Remember to create the build folder if it doesn't already exist. - - To build nutch call the package ant task like this: - - {{{ - ant package - }}} - - This should build nutch into your build folder. When it is finished you are ready to move on to deploying and configuring nutch. - == Setting Up The Deployment Architecture == - -------------------------------------------------------------------------------- - Once we get nutch deployed to all six machines we are going to call a script called start-all.sh that starts the services on the master node and data nodes. This means that the script is going to start the hadoop daemons on the master node and then will ssh into all of the slave nodes and start daemons on the slave nodes. + Once we get nutch deployed to all six machines we are going to call a script called start-all.sh that starts the services on the master node and data nodes. This means that the script is going to start the hadoop daemons on the master node and then will ssh into all of the slave nodes and start daemons on the slave nodes. - The start-all.sh script is going to expect that nutch is installed in exactly the same location on every machine. It is also going to expect that Hadoop is storing the data at the exact same filepath on every machine. + The start-all.sh script is going to expect that Nutch is installed in exactly the same location on every machine. It is also going to expect that Hadoop is storing the data at the exact same filepath on every machine. - The way we did it was to create the following directory structure on every machine. The search directory is where Nutch is installed. The filesystem is the root of the hadoop filesystem. The home directory is the nutch users's home directory. On our master node we also installed a tomcat 5.5 server for searching. + The start-all.sh script that starts the daemons on the master and slave nodes is going to need to be able to use a password-less login through ssh. Since the master node is going to start daemons on itself we also need the ability to user a password-less login on itself, however this should have already been done by now. + We need to setup the environment variables inside of the hadoop-env.sh file. Open the hadoop-env.sh file inside of vi: - {{{ - /nutch - /search - (nutch installation goes here) - /filesystem - /local (used for local directory for searching) - /home - (nutch user's home directory) - /tomcat (only on one server for searching) - }}} - I am not going to go into detail about how to install Tomcat as again there are plenty of tutorials on how to do that. I will say that we removed all of the wars from the webapps directory and created a folder called ROOT under webapps into which we unzipped the Nutch war file (nutch-0.8-dev.war). This makes it easy to edit configuration files inside of the Nutch war - - So log into the master nodes and all of the slave nodes as root. Create the nutch user and the different filesystems with the following commands: - {{{ + cd HADOOP_HOME/conf - ssh -l root devcluster01 - - mkdir /nutch - mkdir /nutch/search - mkdir /nutch/filesystem - mkdir /nutch/local - mkdir /nutch/home - - groupadd users - useradd -d /nutch/home -g users nutch - chown -R nutch:users /nutch - passwd nutch nutchuserpassword - }}} - - Again if you don't have root level access you will still need the same user on each machine as the start-all.sh script expects it. It doesn't have to be a user named nutch user although that is what we use. Also you could put the filesystem under the common user's home directory. Basically, you don't have to be root, but it helps. - - The start-all.sh script that starts the daemons on the master and slave nodes is going to need to be able to use a password-less login through ssh. For this we are going to have to setup ssh keys on each of the nodes. Since the master node is going to start daemons on itself we also need the ability to user a password-less login on itself. - - You might have seen some old tutorials or information floating around the user lists that said you would need to edit the SSH daemon to allow the property PermitUserEnvironment and to setup local environment variables for the ssh logins through an environment file. This has changed. We no longer need to edit the ssh daemon and we can setup the environment variables inside of the hadoop-env.sh file. Open the hadoop-env.sh file inside of vi: - - {{{ - cd /nutch/search/conf vi hadoop-env.sh }}} @@ -138, +66 @@ the hadoop-env.sh file: {{{ - export HADOOP_HOME=/nutch/search - export JAVA_HOME=/usr/java/jdk1.5.0_06 + export HADOOP_HOME=/PATH/TO/HADOOP_HOME + export JAVA_HOME=/PATH/TO/JDK_HOME export HADOOP_LOG_DIR=${HADOOP_HOME}/logs - export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves }}} + Additionally at this stage, in accordance with the Hadoop tutorial, add the IP addresses of your Master and Slaves nodes to HADOOP_HOME/conf/masters & HADOOP_HOME/conf/slaves respectively. - There are other variables in this file which will affect the behavior of Hadoop. If when you start running the script later you start getting ssh errors, try changing the HADOOP_SSH_OPTS variable. Note also that, after the initial copy, you can set HADOOP_MASTER in your conf/hadoop-env.sh and it will use rsync changes on the master to each slave node. There is a section below on how to do this. + There are numerous other variables (documented elsewhere) in the HADOOP_HOME/conf directory which will affect the behaviour of Hadoop. If when you start running the script later you start getting ssh errors, try changing the HADOOP_SSH_OPTS variable. - Next we are going to create the keys on the master node and copy them over to each of the slave nodes. This must be done as the nutch user we created earlier. Don't just su in as the nutch user, start up a new shell and login as the nutch user. If you su in the password-less login we are about to setup will not work in testing but will work when a new session is started as the nutch user. + Next we are going to create the keys on the master node and copy them over to each of the slave nodes. This must be done as the Nutch root user we created earlier. Don't just sudo in as the Nutch user, start up a new shell and login as the Nutch user. If you sudo in the password-less login we are about to setup will not work in testing but will work when a new session is started as the Nutch user. {{{ - cd /nutch/home + cd NUTCH_HOME ssh-keygen -t rsa (Use empty responses for each prompt) Enter passphrase (empty for no passphrase): @@ -163, +91 @@ On the master node you will copy the public key you just created to a file called authorized_keys in the same directory: {{{ - cd /nutch/home/.ssh + cd NUTCH_HOME/.ssh cp id_rsa.pub authorized_keys }}} You only have to run the ssh-keygen on the master node. On each of the slave nodes after the filesystem is created you will just need to copy the keys over using scp. eg to send the authorisation from to devcluster02 we might do this on devcluster01 {{{ - scp /nutch/home/.ssh/authorized_keys nutch@devcluster02:/nutch/home/.ssh/authorized_keys + scp NUTCH_HOME/.ssh/authorized_keys nutch@devcluster02:NUTCH_HOME/.ssh/authorized_keys }}} - You will have to enter the password for the nutch user the first time. An ssh prompt will appear the first time you login to each computer asking if you want to add the computer to the known hosts. Answer yes to the prompt. Once the key is copied you shouldn't have to enter a password when logging in as the nutch user. Test it by logging into the slave nodes that you just copied the keys to: + You will have to enter the password for the Nutch user the first time. A ssh prompt will appear the first time you login to each computer asking if you want to add the computer to the known hosts. Answer yes to the prompt. Once the key is copied you shouldn't have to enter a password when logging in as the Nutch root user. Test it by logging into the slave nodes that you just copied the keys to: {{{ ssh devcluster02 @@ -181, +109 @@ hostname (should return the name of the slave node, here devcluster02) }}} - Once we have the ssh keys created we are ready to start deploying nutch to all of the slave nodes. + Once we have the ssh keys created we are ready to start deploying Nutch to all of the slave nodes. - (Note: this is a rather simple example of how to set up ssh without requiring a passphrase. There are other documents available which can help you with this if you have problems. It is important to test that the nutch user can ssh to all of the machines in your cluster so don't skip this stage) + (Note: this is a rather simple example of how to set up ssh without requiring a passphrase. There are other documents available which can help you with this if you have problems. It is important to test that the Nutch user can ssh to all of the machines in your cluster so don't skip this stage) - == Deploy Nutch to Single Machine == --------------------------------------------------------------------------------

