Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "NutchHadoopTutorial" page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/NutchHadoopTutorial?action=diff&rev1=31&rev2=32 = Nutch and Hadoop Tutorial = - As of the official Nutch 1.3 release the source code architecture has been greatly simplified to allow us to run Nutch in one of two modes; namely '''local''' and '''deploy'''. By default, Nutch no longer comes with a Hadoop distribution, however when run in local mode e.g. running Nutch in a single process on one machine, then we use Hadoop as a dependency. This may suit you fine if you have a small site to crawl and index, but most people choose Nutch because of its capability to run on in deploy mode, within a Hadoop cluster. This gives you the benefit of a distributed file system (HDFS) and MapReduce processing style. The purpose of this tutorial is to provide a step-by-step method to get Nutch running with the Hadoop file system on multiple machines, including being able to both crawl and search across multiple machines. + As of the official Nutch 1.3 release the source code architecture has been greatly simplified to allow us to run Nutch in one of two modes; namely '''local''' and '''deploy'''. By default, Nutch no longer comes with a Hadoop distribution, however when run in local mode e.g. running Nutch in a single process on one machine, then we use Hadoop as a dependency. This may suit you fine if you have a small site to crawl and index, but most people choose Nutch because of its capability to run on in deploy mode, within a Hadoop cluster. This gives you the benefit of a distributed file system (HDFS) and MapReduce processing style. The purpose of this tutorial is to provide a step-by-step method to get Nutch running with the Hadoop file system on multiple machines, including being able to both crawl and search across multiple machines. N.B. This tutorial is designed and maintained to work with Nutch trunk. This document does not go into the Nutch or Hadoop architecture, resources relating to these topics can be found [[FrontPage#Nutch Development|here]]. It only tells how to get the systems up and running. There are also relevant resources at the end of this tutorial if you want to know more about the architecture of Nutch and Hadoop. @@ -12, +12 @@ Some things are assumed for this tutorial: - First: I performed some setup and using root level access. This included setting up the same user across multiple machines and setting up a local filesystem outside of the user's home directory. Root access is not required to setup Nutch and Hadoop (although sometimes it is convenient). If you do not have root access, you will need the same user setup across all machines which you are using and you will probably need to use a local filesystem inside of your home directory. + 1) Perform this setup using root level access. This includes setting up the same user across multiple machines and setting up a local filesystem outside of the user's home directory. Root access is not required to setup Nutch and Hadoop (although sometimes it is convenient). If you do not have root access, you will need the same user setup across all machines which you are using and you will probably need to use a local filesystem inside of your home directory. - Two: all boxes will need an SSH server running (not just a client) as Hadoop uses SSH to start slave servers. Although we try to explain how to set up ssh so that communication between machines does not require a password you may need to learn how to do that elsewhere. + 2) All boxes will need an SSH server running (not just a client) as Hadoop uses SSH to start slave servers. Although we try to explain how to set up ssh so that communication between machines does not require a password you may need to learn how to do that elsewhere. Please see halfway down this tutorial[[http://help.github.com/linux-set-up-git/|here]] - Three: This tutorial uses Whitebox Enterprise Linux 3 Respin 2 (WHEL). For those of you who don't know Whitebox, it is a RedHat Enterprise Linux clone. You should be able to follow along for any linux system, but the systems I use are Whitebox. (Later versions of this document have been tested using Ubuntu Linux, but as before - - Four: This tutorial was originally written for Nutch 0.8 Dev Revision 385702, but has been updated to work with Nutch 1.1RC. It may not be compatible with future releases of either Nutch or Hadoop. - - Five: For this tutorial we setup nutch across 6 different computers. If you are using a different number of machines you should still be fine but you should have at least two different machines to prove the distributed capabilities of both HDFS and MapReduce. + 3) For this tutorial we setup Nutch across a 6 node Hadoop cluster. If you are using a different number of machines you should still be fine but you should have at least two different machines to prove the distributed capabilities of both HDFS and MapReduce. - Six: Remember that this is a tutorial from my personal experience setting up Nutch and Hadoop. If something doesn't work for you try searching and sending a message to the Nutch or Hadoop users mailing list. Suggestions or tips are welcome. Why not add them to the end of this Wiki page? + 4) If something doesn't work for you try first searching then sending a message to the Nutch or Hadoop users mailing list. Good questions as well as suggestions or tips are welcome. Why not add them to the end of this Wiki page? - Seven: We assume that you are a Java programmer familiar with the concepts of JAVA_HOME, ant build tool, subversion, IDEs and such like. + 5) A real no brainer... we assume that you are a Java programmer familiar with the concepts of JAVA_HOME, ant build tool, subversion, IDEs and such like. + Ok lets have some fun! + - == Our Network Setup == + == Network Setup == -------------------------------------------------------------------------------- - First let me layout the computers that we used in our setup. To setup Nutch and Hadoop we had 7 commodity computers ranging from 750Mghz to 1.0 Ghz. Each computer had at least 128 Megs of RAM and at least a 10 Gigabyte hard drive. One computer had dual 750 Mghz CPUs and another had dual 30 Gigabyte hard drives. All of these computers were purchased for under $500.00 at a liquidation sale. I am telling you this to let you know that you don't have to have big hardware to get up and running with Nutch and Hadoop. Our computers were named like this: + It is important to know that you don't have to have big hardware to get up and running with Nutch and Hadoop. The architecture was designed in such a way to make the most of commodity hardware. For the purpose of this tutorial the nodes in the 6 node cluster are named as follows: {{{ devcluster01 @@ -40, +38 @@ devcluster06 }}} - Our master node was devcluster01. By master node I mean that it ran the Hadoop services that coordinated with the slave nodes (all of the other computers) and it was the machine on which we performed our crawl and deployed our search website. + To begin, our master node is devcluster01, by master node I mean that it will run the Hadoop services that coordinate with the slave nodes (all of the other computers) and it is the machine on which we performed our crawl. == Downloading Nutch and Hadoop == --------------------------------------------------------------------------------

