Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "NutchHadoopTutorial" page has been changed by ChiaHungLin. http://wiki.apache.org/nutch/NutchHadoopTutorial?action=diff&rev1=28&rev2=29 -------------------------------------------------- == Deploy Nutch to Multiple Machines == -------------------------------------------------------------------------------- + + '''The main point is to copy nutch-* (under $nutch_home/conf) and crawl-urlfilter.txt files to $hadoop_home/conf (all machines, including master and slaves) folder so that the hadoop cluster can pick up those configuration when startup. Otherwise nutch will complain with messages e.g. "0 records selected for fetching, exiting .. URLs to fetch - check your seed list and URL filters."''' + + Once you have got the single node up and running we can copy the configuration to the other slave nodes and setup those slave nodes to be started out start script. First if you still have the servers running on the local node stop them with the stop-all script. To copy the configuration to the other machines run the following command. If you have followed the configuration up to this point, things should go smoothly: @@ -457, +461 @@ cd /nutch/search scp -r /nutch/search/* nutch@computer:/nutch/search }}} - - '''The main point is to copy nutch-* (under $nutch_home/conf) and crawl-urlfilter.txt files to $hadoop_home/conf (all machines, including master and slaves) folder so that hadoop cluster can pick up those configuration when startup. Otherwise hadoop cluster will complain with messages e.g. "0 records selected for fetching, exiting .. URLs to fetch - check your seed list and URL filters."''' Do this for every computer you want to use as a slave node. Then edit the slaves file, adding each slave node name to the file, one per line. You will also want to edit the hadoop-site.xml file and change the values for the map and reduce task numbers, making this a multiple of the number of machines you have. For our system which has 6 data nodes I put in 32 as the number of tasks. The replication property can also be changed at this time. A good starting value is something like 2 or 3. *(see Note at bottom about possibly having to clear filesystem of new datanodes). Once this is done you should be able to startup all of the nodes.

