[Nutch Wiki] Trivial Update of "NutchHadoopTutorial" by TejasPatil

Apache Wiki Sat, 08 Jun 2013 12:22:59 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "NutchHadoopTutorial" page has been changed by TejasPatil:
https://wiki.apache.org/nutch/NutchHadoopTutorial?action=diff&rev1=40&rev2=41

Comment:
correction in crawl command (by Wahaj All over @user)

  to read:                      +^http://([a-z0-9]*\.)*apache.org/
  }}}
  
- We have already added our urls to the distributed filesystem and we have 
edited our urlfilter so now it is time to begin the crawl. To start the nutch 
crawl firstly copy your nutch-${version}.job jar over to $HADOOP_HOME, then use 
the following command:
+ We have already added our urls to the distributed filesystem and we have 
edited our urlfilter so now it is time to begin the crawl. To start the nutch 
crawl firstly copy your apache-nutch-${version}.job jar over to $HADOOP_HOME, 
then use the following command:
  
  {{{
  cd $HADOOP_HOME
- hadoop jar nutch-${version}.jar org.apache.nutch.crawl.Crawl urls -dir urls 
-depth 3 -topN 5
+ hadoop jar apache-nutch-${version}.job org.apache.nutch.crawl.Crawl urls -dir 
crawlDir -depth 3 -topN 5
  }}}
  
  We are using the nutch crawl command.  The urls dir is the urls directory 
that we added to the distributed filesystem. The "-dir crawl" is the output 
directory.  This will also go to the distributed filesystem.  The depth is 3 
meaning it will only get 3 page links deep.  There are other options you can 
specify, see the command documentation for those options.

[Nutch Wiki] Trivial Update of "NutchHadoopTutorial" by TejasPatil

Reply via email to