[Nutch Wiki] Trivial Update of "NutchHadoopTutorial" by LewisJohnMcgibbney

Apache Wiki Fri, 02 Sep 2011 13:10:56 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "NutchHadoopTutorial" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/NutchHadoopTutorial?action=diff&rev1=30&rev2=31

+ = Nutch and Hadoop Tutorial =
- = How to Setup Nutch (V1.1) and Hadoop =
- 
--------------------------------------------------------------------------------
  
- Note: Originally this ([[NutchHadoopTutorial0.8]]) was written for version 
0.8 of Nutch. This has been edited by people other than the original author so 
statements like "I did this" or "I recommend that" are slightly misleading. 
+ As of the official Nutch 1.3 release the source code architecture has been 
greatly simplified to allow us to run Nutch in one of two modes; namely 
'''local''' and '''deploy'''. By default, Nutch no longer comes with a Hadoop 
distribution, however when run in local mode e.g. running Nutch in a single 
process on one machine, then we use Hadoop as a dependency. This may suit you 
fine if you have a small site to crawl and index, but most people choose Nutch 
because of its capability to run on in deploy mode, within a Hadoop cluster. 
This gives you the benefit of a distributed file system (HDFS) and MapReduce 
processing style.  The purpose of this tutorial is to provide a step-by-step 
method to get Nutch running with the Hadoop file system on multiple machines, 
including being able to both crawl and search across multiple machines.  
  
- 
--------------------------------------------------------------------------------
+ This document does not go into the Nutch or Hadoop architecture, resources 
relating to these topics can be found [[FrontPage#Nutch Development|here]]. It 
only tells how to get the systems up and running. There are also relevant 
resources at the end of this tutorial if you want to know more about the 
architecture of Nutch and Hadoop.
  
+ '''N.B.''' Prerequsites for this tutorial are both the [[NutchTutorial|Nutch 
Tutorial]] and the [[http://hadoop.apache.org/common/docs/stable/|Hadoop 
Tutorial]]. 
- By default, out of the box, Nutch runs in a single process on one machine. 
This may suit you fine if you have a small site to crawl and index, but most 
people choose Nutch because of its capability to run on a Hadoop cluster. This 
gives you the benefit of a distributed file system (HDFS) and MapReduce 
processing style.  The purpose of this tutorial is to provide a step-by-step 
method to get Nutch running with Hadoop file system on multiple machines, 
including being able to both index (crawl) and search across multiple machines. 
 
- 
- This document does not go into the Nutch or Hadoop architecture.  It only 
tells how to get the systems up and running.  At the end of the tutorial though 
I will point you to relevant resources if you want to know more about the 
architecture of Nutch and Hadoop.
- 
- The tutorial comes in two phases. Firstly we get Hadoop running on a single 
machine (a bit of a simple cluster!) and then more than one machine.
  
  === Assumptions ===

[Nutch Wiki] Trivial Update of "NutchHadoopTutorial" by LewisJohnMcgibbney

Reply via email to