[Nutch Wiki] Trivial Update of "NutchHadoopTutorial" by LewisJohnMcgibbney

Apache Wiki Mon, 19 Mar 2012 07:51:38 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "NutchHadoopTutorial" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/NutchHadoopTutorial?action=diff&rev1=36&rev2=37

  {{{
  cd /nutch/search
  mkdir urlsdir
- vi urlsdir/urllist.txt
+ vi urls/seed.txt
  
- http://lucene.apache.org
+ http://nutch.apache.org
+ http://apache.org
  }}}
  
- You should now have a urls/urllist.txt file with the one line pointing to the 
apache lucene site.  Now we are going to add that directory to the filesystem.  
Later the nutch crawl will use this file as a list of urls to crawl.  To add 
the urls directory to the filesystem run the following command:
+ You should now have a urls/seed.txt file with two URLs (one per line) 
pointing to the Apache Nutch and Apache Software Foundation home sites 
respectively. Now we are going to add that directory to the filesystem.  Later 
the nutch crawl will use this file as a list of urls to crawl.  To add the urls 
directory to the filesystem run the following command:
  
  {{{
  cd /nutch/search
- bin/hadoop dfs -put urlsdir urlsdir
+ bin/hadoop dfs -put urls urls
  }}}
  
  You should see output stating that the directory was added to the filesystem. 
You can also confirm that the directory was added by using the ls command:
@@ -345, +346 @@

  
  Something interesting to note about the distributed filesystem is that it is 
user specific.  If you store a directory urls under the filesystem with the 
nutch user, it is actually stored as /user/nutch/urls.  What this means to us 
is that the user that does the crawl and stores it in the distributed 
filesystem must also be the user that starts the search, or no results will 
come back.  You can try this yourself by logging in with a different user and 
runing the ls command as shown.  It won't find the directories because is it 
looking under a different directory /user/username instead of /user/nutch.
  
+ At this stage it might be beneficial to try out a test crawl.
+ 
+ From your hadoop home directory execute
+ 
+ {{{
+ hadoop jar nutch-${version}.jar org.apache.nutch.crawl.Crawl urls -dir urls 
-depth 1 -topN 5
+ }}}
+ 
+ As before, you can track progress through your logs, or alternatively 
navigate to the aforementioned Hadoop gui's.
+  
- If everything worked then you are good to add other nodes and start the crawl.
+ If everything worked then you are good to add other nodes and start the crawl 
;)
  
  
  == Deploy Nutch to Multiple Machines ==

[Nutch Wiki] Trivial Update of "NutchHadoopTutorial" by LewisJohnMcgibbney

Reply via email to