[Nutch Wiki] Trivial Update of "NutchHadoopTutorial" by LewisJohnMcgibbney

Apache Wiki Tue, 15 Nov 2011 14:33:15 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "NutchHadoopTutorial" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/NutchHadoopTutorial?action=diff&rev1=32&rev2=33

  
  Ok lets have some fun!
  
+ == Hadoop Cluster Setup ==
- == Network Setup ==
- 
--------------------------------------------------------------------------------
  
  It is important to know that you don't have to have big hardware to get up 
and running with Nutch and Hadoop. The architecture was designed in such a way 
to make the most of commodity hardware. For the purpose of this tutorial the 
nodes in the 6 node cluster are named as follows:
  
@@ -40, +39 @@

  
  To begin, our master node is devcluster01, by master node I mean that it will 
run the Hadoop services that coordinate with the slave nodes (all of the other 
computers) and it is the machine on which we performed our crawl.
  
+ == Downloading Hadoop and Nutch ==
- == Downloading Nutch and Hadoop ==
- 
--------------------------------------------------------------------------------
- Both Nutch and Hadoop are downloadable from the Apache website.  The 
necessary Hadoop files are bundled with Nutch so unless you are going to be 
developing Hadoop you only need to download Nutch.
  
+ Both Nutch and Hadoop are downloadable from their respective Apache websites.
- We built Nutch from source after downloading it from its subversion 
repository.
- Nightly builds of Nutch can be found here:
  
- http://hudson.zones.apache.org/hudson/job/Nutch-trunk/
+ You can checkout the latest and greatest Nutch from source after downloading 
it from its SVN repository 
[[http://svn.apache.org/repos/asf/nutch/trunk/|here]]. Alternatively pick up a 
stable release from the Nutch site. The same should be done with Hadoop, and as 
mentioned eariler this, along with how to set up your 6 node cluster is 
included in the [[http://hadoop.apache.org/common/docs/stable/|Hadoop 
Tutorial]].
  
+ We are going to use ant to build it so if you have java and ant installed you 
should be fine. This tutorial is not going to go into how to install java or 
ant, if you want a complete reference for ant pick up Erik Hatcher's book 
[[http://www.manning.com/hatcher|Java Development with Ant]]
- At time of writing this version (Jun 2010) Nutch includes Hadoop Jars version 
0.20.2
- 
- You can get a packaged tarball or extract from subversion. Knowing how to use 
tar or subversion is outside of the scope of this tutorial. Once you have a 
subversion client you can either browse the Nutch subversion webpage at:
- 
- http://nutch.apache.org/version_control.html
- 
- Or you can access the Nutch subversion repository through the client at:
- 
- http://svn.apache.org/repos/asf/nutch/ (previously at 
http://svn.apache.org/repos/asf/lucene/nutch/ when Nutch was a part of Lucene)
- 
- We are going to use ant to build it so if you have java and ant installed you 
should be fine.
- 
- I am not going to go into how to install java or ant, if you are working with 
this level of software you should know how to do that and there are plenty of 
tutorial on building software with ant.  If you want a complete reference for 
ant pick up Erik Hatcher's book "''Java Development with Ant''":
- 
- http://www.manning.com/hatcher
- 
- It is worth noting that previous versions of Nutch came already built. But 
nowadays the release is just source code and so does have to be built before 
use.
  
  == Building Nutch and Hadoop ==
  
--------------------------------------------------------------------------------

[Nutch Wiki] Trivial Update of "NutchHadoopTutorial" by LewisJohnMcgibbney

Reply via email to