[Nutch Wiki] Trivial Update of "NutchHadoopTutorial" by LewisJohnMcgibbney

Apache Wiki Tue, 20 Mar 2012 07:45:18 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "NutchHadoopTutorial" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/NutchHadoopTutorial?action=diff&rev1=39&rev2=40

  
  This document does not go into the Nutch or Hadoop architecture, resources 
relating to these topics can be found [[FrontPage#Nutch Development|here]]. It 
only tells how to get the systems up and running. There are also relevant 
resources at the end of this tutorial if you want to know more about the 
architecture of Nutch and Hadoop.
  
+ '''N.B.''' 
- '''N.B.''' Prerequsites for this tutorial are both the [[NutchTutorial|Nutch 
Tutorial]] and the [[http://hadoop.apache.org/common/docs/stable/|Hadoop 
Tutorial]]. It will also be of great benefit to have a look at the 
[[http://wiki.apache.org/hadoop/|Hadoop Wiki]]
+ '''1.''' Prerequsites for this tutorial are both the [[NutchTutorial|Nutch 
Tutorial]] and the [[http://hadoop.apache.org/common/docs/stable/|Hadoop 
Tutorial]]. It will also be of great benefit to have a look at the 
[[http://wiki.apache.org/hadoop/|Hadoop Wiki]].
+ 
+ '''2.''' In addition it is really really easy to get Nutch running if you 
already have an existing Hadoop cluster up and running, therefore it is 
strongly advised to begin with the Hadoop cluster setup then come over to this 
tutorial.  
  <<TableOfContents(3)>>
  
  === Assumptions ===
@@ -352, +355 @@

  == Deploy Nutch to Multiple Machines ==
  
--------------------------------------------------------------------------------
  
- '''The main point is to copy nutch-* (under $nutch_home/conf) and 
crawl-urlfilter.txt files to $hadoop_home/conf (all machines, including master 
and slaves) folder so that the hadoop cluster can pick up those configuration 
when startup. Otherwise nutch will complain with messages e.g. "0 records 
selected for fetching, exiting .. URLs to fetch - check your seed list and URL 
filters."'''
+ Along with the new Nutch architecture presented in version 1.3 onwards we no 
longer need to copy any Nutch jar files and/or configuration to each node in 
the cluster.  
  
+ The Nutch job jar you find in $NUTCH_HOME/runtime/deploy is self containing 
and ships with all the configuration files necessary for Nutch to be able to 
run on any vanilla Hadoop cluster. All you need is a healthy cluster and a 
Hadoop environment (cluster or local) that points to the jobtracker.
- 
- Once you have got the single node up and running we can copy the 
configuration to the other slave nodes and setup those slave nodes to be 
started out start script.  First if you still have the servers running on the 
local node stop them with the stop-all script.
- 
- To copy the configuration to the other machines run the following command.  
If you have followed the configuration up to this point, things should go 
smoothly:
- 
- {{{
- cd /nutch/search
- scp -r /nutch/search/* nutch@computer:/nutch/search
- }}}
- 
- Do this for every computer you want to use as a slave node.  Then edit the 
slaves file, adding each slave node name to the file, one per line.  You will 
also want to edit the hadoop-site.xml file and change the values for the map 
and reduce task numbers, making this a multiple of the number of machines you 
have.  For our system which has 6 data nodes I put in 32 as the number of 
tasks.  The replication property can also be changed at this time.  A good 
starting value is something like 2 or 3. *(see Note at bottom about possibly 
having to clear filesystem of new datanodes).   Once this is done you should be 
able to startup all of the nodes.
- 
- To start all of the nodes we use the exact same command as before:
- 
- {{{
- cd /nutch/search
- bin/start-all.sh
- }}}
- 
- '''A command like 'bin/slaves.sh uptime' is a good way to test that things 
are configured correctly before attempting to call the start-all.sh script.'''
- 
- The first time all of the nodes are started there may be the ssh dialog 
asking to add the hosts to the known_hosts file.  You will have to type in yes 
for each one and hit enter.  The output may be a little wierd the first time 
but just keep typing yes and hitting enter if the dialogs keep appearing.  You 
should see output showing all the servers starting on the local machine and the 
job tracker and data nodes servers starting on the slave nodes.  Once this is 
complete we are ready to begin our crawl.
  
  
  == Performing a Nutch Crawl ==

[Nutch Wiki] Trivial Update of "NutchHadoopTutorial" by LewisJohnMcgibbney

Reply via email to