Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "NutchHadoopTutorial" page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/NutchHadoopTutorial?action=diff&rev1=39&rev2=40 This document does not go into the Nutch or Hadoop architecture, resources relating to these topics can be found [[FrontPage#Nutch Development|here]]. It only tells how to get the systems up and running. There are also relevant resources at the end of this tutorial if you want to know more about the architecture of Nutch and Hadoop. + '''N.B.''' - '''N.B.''' Prerequsites for this tutorial are both the [[NutchTutorial|Nutch Tutorial]] and the [[http://hadoop.apache.org/common/docs/stable/|Hadoop Tutorial]]. It will also be of great benefit to have a look at the [[http://wiki.apache.org/hadoop/|Hadoop Wiki]] + '''1.''' Prerequsites for this tutorial are both the [[NutchTutorial|Nutch Tutorial]] and the [[http://hadoop.apache.org/common/docs/stable/|Hadoop Tutorial]]. It will also be of great benefit to have a look at the [[http://wiki.apache.org/hadoop/|Hadoop Wiki]]. + + '''2.''' In addition it is really really easy to get Nutch running if you already have an existing Hadoop cluster up and running, therefore it is strongly advised to begin with the Hadoop cluster setup then come over to this tutorial. <<TableOfContents(3)>> === Assumptions === @@ -352, +355 @@ == Deploy Nutch to Multiple Machines == -------------------------------------------------------------------------------- - '''The main point is to copy nutch-* (under $nutch_home/conf) and crawl-urlfilter.txt files to $hadoop_home/conf (all machines, including master and slaves) folder so that the hadoop cluster can pick up those configuration when startup. Otherwise nutch will complain with messages e.g. "0 records selected for fetching, exiting .. URLs to fetch - check your seed list and URL filters."''' + Along with the new Nutch architecture presented in version 1.3 onwards we no longer need to copy any Nutch jar files and/or configuration to each node in the cluster. + The Nutch job jar you find in $NUTCH_HOME/runtime/deploy is self containing and ships with all the configuration files necessary for Nutch to be able to run on any vanilla Hadoop cluster. All you need is a healthy cluster and a Hadoop environment (cluster or local) that points to the jobtracker. - - Once you have got the single node up and running we can copy the configuration to the other slave nodes and setup those slave nodes to be started out start script. First if you still have the servers running on the local node stop them with the stop-all script. - - To copy the configuration to the other machines run the following command. If you have followed the configuration up to this point, things should go smoothly: - - {{{ - cd /nutch/search - scp -r /nutch/search/* nutch@computer:/nutch/search - }}} - - Do this for every computer you want to use as a slave node. Then edit the slaves file, adding each slave node name to the file, one per line. You will also want to edit the hadoop-site.xml file and change the values for the map and reduce task numbers, making this a multiple of the number of machines you have. For our system which has 6 data nodes I put in 32 as the number of tasks. The replication property can also be changed at this time. A good starting value is something like 2 or 3. *(see Note at bottom about possibly having to clear filesystem of new datanodes). Once this is done you should be able to startup all of the nodes. - - To start all of the nodes we use the exact same command as before: - - {{{ - cd /nutch/search - bin/start-all.sh - }}} - - '''A command like 'bin/slaves.sh uptime' is a good way to test that things are configured correctly before attempting to call the start-all.sh script.''' - - The first time all of the nodes are started there may be the ssh dialog asking to add the hosts to the known_hosts file. You will have to type in yes for each one and hit enter. The output may be a little wierd the first time but just keep typing yes and hitting enter if the dialogs keep appearing. You should see output showing all the servers starting on the local machine and the job tracker and data nodes servers starting on the slave nodes. Once this is complete we are ready to begin our crawl. == Performing a Nutch Crawl ==

