Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "NutchHadoopTutorial" page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/NutchHadoopTutorial?action=diff&rev1=30&rev2=31 + = Nutch and Hadoop Tutorial = - = How to Setup Nutch (V1.1) and Hadoop = - -------------------------------------------------------------------------------- - Note: Originally this ([[NutchHadoopTutorial0.8]]) was written for version 0.8 of Nutch. This has been edited by people other than the original author so statements like "I did this" or "I recommend that" are slightly misleading. + As of the official Nutch 1.3 release the source code architecture has been greatly simplified to allow us to run Nutch in one of two modes; namely '''local''' and '''deploy'''. By default, Nutch no longer comes with a Hadoop distribution, however when run in local mode e.g. running Nutch in a single process on one machine, then we use Hadoop as a dependency. This may suit you fine if you have a small site to crawl and index, but most people choose Nutch because of its capability to run on in deploy mode, within a Hadoop cluster. This gives you the benefit of a distributed file system (HDFS) and MapReduce processing style. The purpose of this tutorial is to provide a step-by-step method to get Nutch running with the Hadoop file system on multiple machines, including being able to both crawl and search across multiple machines. - -------------------------------------------------------------------------------- + This document does not go into the Nutch or Hadoop architecture, resources relating to these topics can be found [[FrontPage#Nutch Development|here]]. It only tells how to get the systems up and running. There are also relevant resources at the end of this tutorial if you want to know more about the architecture of Nutch and Hadoop. + '''N.B.''' Prerequsites for this tutorial are both the [[NutchTutorial|Nutch Tutorial]] and the [[http://hadoop.apache.org/common/docs/stable/|Hadoop Tutorial]]. - By default, out of the box, Nutch runs in a single process on one machine. This may suit you fine if you have a small site to crawl and index, but most people choose Nutch because of its capability to run on a Hadoop cluster. This gives you the benefit of a distributed file system (HDFS) and MapReduce processing style. The purpose of this tutorial is to provide a step-by-step method to get Nutch running with Hadoop file system on multiple machines, including being able to both index (crawl) and search across multiple machines. - - This document does not go into the Nutch or Hadoop architecture. It only tells how to get the systems up and running. At the end of the tutorial though I will point you to relevant resources if you want to know more about the architecture of Nutch and Hadoop. - - The tutorial comes in two phases. Firstly we get Hadoop running on a single machine (a bit of a simple cluster!) and then more than one machine. === Assumptions ===

