[Nutch Wiki] Trivial Update of "NutchHadoopTutorial" by LewisJohnMcgibbney

Apache Wiki Tue, 15 Nov 2011 14:26:35 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "NutchHadoopTutorial" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/NutchHadoopTutorial?action=diff&rev1=31&rev2=32

  = Nutch and Hadoop Tutorial =
  
- As of the official Nutch 1.3 release the source code architecture has been 
greatly simplified to allow us to run Nutch in one of two modes; namely 
'''local''' and '''deploy'''. By default, Nutch no longer comes with a Hadoop 
distribution, however when run in local mode e.g. running Nutch in a single 
process on one machine, then we use Hadoop as a dependency. This may suit you 
fine if you have a small site to crawl and index, but most people choose Nutch 
because of its capability to run on in deploy mode, within a Hadoop cluster. 
This gives you the benefit of a distributed file system (HDFS) and MapReduce 
processing style.  The purpose of this tutorial is to provide a step-by-step 
method to get Nutch running with the Hadoop file system on multiple machines, 
including being able to both crawl and search across multiple machines.  
+ As of the official Nutch 1.3 release the source code architecture has been 
greatly simplified to allow us to run Nutch in one of two modes; namely 
'''local''' and '''deploy'''. By default, Nutch no longer comes with a Hadoop 
distribution, however when run in local mode e.g. running Nutch in a single 
process on one machine, then we use Hadoop as a dependency. This may suit you 
fine if you have a small site to crawl and index, but most people choose Nutch 
because of its capability to run on in deploy mode, within a Hadoop cluster. 
This gives you the benefit of a distributed file system (HDFS) and MapReduce 
processing style. The purpose of this tutorial is to provide a step-by-step 
method to get Nutch running with the Hadoop file system on multiple machines, 
including being able to both crawl and search across multiple machines. N.B. 
This tutorial is designed and maintained to work with Nutch trunk.
  
  This document does not go into the Nutch or Hadoop architecture, resources 
relating to these topics can be found [[FrontPage#Nutch Development|here]]. It 
only tells how to get the systems up and running. There are also relevant 
resources at the end of this tutorial if you want to know more about the 
architecture of Nutch and Hadoop.
  
@@ -12, +12 @@

  
  Some things are assumed for this tutorial:
  
- First: I performed some setup and using root level access.  This included 
setting up the same user across multiple machines and setting up a local 
filesystem outside of the user's home directory.  Root access is not required 
to setup Nutch and Hadoop (although sometimes it is convenient).  If you do not 
have root access, you will need the same user setup across all machines which 
you are using and you will probably need to use a local filesystem inside of 
your home directory.
+ 1) Perform this setup using root level access.  This includes setting up the 
same user across multiple machines and setting up a local filesystem outside of 
the user's home directory.  Root access is not required to setup Nutch and 
Hadoop (although sometimes it is convenient).  If you do not have root access, 
you will need the same user setup across all machines which you are using and 
you will probably need to use a local filesystem inside of your home directory.
  
- Two: all boxes will need an SSH server running (not just a client) as Hadoop 
uses SSH to start slave servers. Although we try to explain how to set up ssh 
so that communication between machines does not require a password you may need 
to learn how to do that elsewhere.
+ 2) All boxes will need an SSH server running (not just a client) as Hadoop 
uses SSH to start slave servers. Although we try to explain how to set up ssh 
so that communication between machines does not require a password you may need 
to learn how to do that elsewhere. Please see halfway down this 
tutorial[[http://help.github.com/linux-set-up-git/|here]]
  
- Three: This tutorial uses Whitebox Enterprise Linux 3 Respin 2 (WHEL).  For 
those of you who don't know Whitebox, it is a RedHat Enterprise Linux clone.  
You should be able to follow along for any linux system, but the systems I use 
are Whitebox. (Later versions of this document have been tested using Ubuntu 
Linux, but as before
- 
- Four: This tutorial was originally written for Nutch 0.8 Dev Revision 385702, 
but has been updated to work with Nutch 1.1RC. It may not be compatible with 
future releases of either Nutch or Hadoop.
- 
- Five: For this tutorial we setup nutch across 6 different computers.  If you 
are using a different number of machines you should still be fine but you 
should have at least two different machines to prove the distributed 
capabilities of both HDFS and MapReduce.  
+ 3) For this tutorial we setup Nutch across a 6 node Hadoop cluster. If you 
are using a different number of machines you should still be fine but you 
should have at least two different machines to prove the distributed 
capabilities of both HDFS and MapReduce.  
  
- Six: Remember that this is a tutorial from my personal experience setting up 
Nutch and Hadoop.  If something doesn't work for you try searching and sending 
a message to the Nutch or Hadoop users mailing list.  Suggestions or tips are 
welcome. Why not add them to the end of this Wiki page?
+ 4) If something doesn't work for you try first searching then sending a 
message to the Nutch or Hadoop users mailing list. Good questions as well as 
suggestions or tips are welcome. Why not add them to the end of this Wiki page?
  
- Seven: We assume that you are a Java programmer familiar with the concepts of 
JAVA_HOME, ant build tool, subversion, IDEs and such like. 
+ 5) A real no brainer... we assume that you are a Java programmer familiar 
with the concepts of JAVA_HOME, ant build tool, subversion, IDEs and such like. 
  
+ Ok lets have some fun!
+ 
- == Our Network Setup ==
+ == Network Setup ==
  
--------------------------------------------------------------------------------
  
- First let me layout the computers that we used in our setup.  To setup Nutch 
and Hadoop we had 7 commodity computers ranging from 750Mghz to 1.0 Ghz.  Each 
computer had at least 128 Megs of RAM and at least a 10 Gigabyte hard drive.  
One computer had dual 750 Mghz CPUs and another had dual 30 Gigabyte hard 
drives.  All of these computers were purchased for under $500.00 at a 
liquidation sale.  I am telling you this to let you know that you don't have to 
have big hardware to get up and running with Nutch and Hadoop.  Our computers 
were named like this:
+ It is important to know that you don't have to have big hardware to get up 
and running with Nutch and Hadoop. The architecture was designed in such a way 
to make the most of commodity hardware. For the purpose of this tutorial the 
nodes in the 6 node cluster are named as follows:
  
  {{{
  devcluster01
@@ -40, +38 @@

  devcluster06
  }}}
  
- Our master node was devcluster01.  By master node I mean that it ran the 
Hadoop services that coordinated with the slave nodes (all of the other 
computers) and it was the machine on which we performed our crawl and deployed 
our search website.  
+ To begin, our master node is devcluster01, by master node I mean that it will 
run the Hadoop services that coordinate with the slave nodes (all of the other 
computers) and it is the machine on which we performed our crawl.
  
  == Downloading Nutch and Hadoop ==
  
--------------------------------------------------------------------------------

[Nutch Wiki] Trivial Update of "NutchHadoopTutorial" by LewisJohnMcgibbney

Reply via email to