Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "NutchHadoopTutorial" page has been changed by ChiaHungLin.
http://wiki.apache.org/nutch/NutchHadoopTutorial?action=diff&rev1=28&rev2=29

--------------------------------------------------

  
  == Deploy Nutch to Multiple Machines ==
  
--------------------------------------------------------------------------------
+ 
+ '''The main point is to copy nutch-* (under $nutch_home/conf) and 
crawl-urlfilter.txt files to $hadoop_home/conf (all machines, including master 
and slaves) folder so that the hadoop cluster can pick up those configuration 
when startup. Otherwise nutch will complain with messages e.g. "0 records 
selected for fetching, exiting .. URLs to fetch - check your seed list and URL 
filters."'''
+ 
+ 
  Once you have got the single node up and running we can copy the 
configuration to the other slave nodes and setup those slave nodes to be 
started out start script.  First if you still have the servers running on the 
local node stop them with the stop-all script.
  
  To copy the configuration to the other machines run the following command.  
If you have followed the configuration up to this point, things should go 
smoothly:
@@ -457, +461 @@

  cd /nutch/search
  scp -r /nutch/search/* nutch@computer:/nutch/search
  }}}
- 
- '''The main point is to copy nutch-* (under $nutch_home/conf) and 
crawl-urlfilter.txt files to $hadoop_home/conf (all machines, including master 
and slaves) folder so that hadoop cluster can pick up those configuration when 
startup. Otherwise hadoop cluster will complain with messages e.g. "0 records 
selected for fetching, exiting .. URLs to fetch - check your seed list and URL 
filters."'''
  
  Do this for every computer you want to use as a slave node.  Then edit the 
slaves file, adding each slave node name to the file, one per line.  You will 
also want to edit the hadoop-site.xml file and change the values for the map 
and reduce task numbers, making this a multiple of the number of machines you 
have.  For our system which has 6 data nodes I put in 32 as the number of 
tasks.  The replication property can also be changed at this time.  A good 
starting value is something like 2 or 3. *(see Note at bottom about possibly 
having to clear filesystem of new datanodes).   Once this is done you should be 
able to startup all of the nodes.
  

Reply via email to