Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "NutchHadoopTutorial" page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/NutchHadoopTutorial?action=diff&rev1=36&rev2=37 {{{ cd /nutch/search mkdir urlsdir - vi urlsdir/urllist.txt + vi urls/seed.txt - http://lucene.apache.org + http://nutch.apache.org + http://apache.org }}} - You should now have a urls/urllist.txt file with the one line pointing to the apache lucene site. Now we are going to add that directory to the filesystem. Later the nutch crawl will use this file as a list of urls to crawl. To add the urls directory to the filesystem run the following command: + You should now have a urls/seed.txt file with two URLs (one per line) pointing to the Apache Nutch and Apache Software Foundation home sites respectively. Now we are going to add that directory to the filesystem. Later the nutch crawl will use this file as a list of urls to crawl. To add the urls directory to the filesystem run the following command: {{{ cd /nutch/search - bin/hadoop dfs -put urlsdir urlsdir + bin/hadoop dfs -put urls urls }}} You should see output stating that the directory was added to the filesystem. You can also confirm that the directory was added by using the ls command: @@ -345, +346 @@ Something interesting to note about the distributed filesystem is that it is user specific. If you store a directory urls under the filesystem with the nutch user, it is actually stored as /user/nutch/urls. What this means to us is that the user that does the crawl and stores it in the distributed filesystem must also be the user that starts the search, or no results will come back. You can try this yourself by logging in with a different user and runing the ls command as shown. It won't find the directories because is it looking under a different directory /user/username instead of /user/nutch. + At this stage it might be beneficial to try out a test crawl. + + From your hadoop home directory execute + + {{{ + hadoop jar nutch-${version}.jar org.apache.nutch.crawl.Crawl urls -dir urls -depth 1 -topN 5 + }}} + + As before, you can track progress through your logs, or alternatively navigate to the aforementioned Hadoop gui's. + - If everything worked then you are good to add other nodes and start the crawl. + If everything worked then you are good to add other nodes and start the crawl ;) == Deploy Nutch to Multiple Machines ==

