Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "NutchHadoopTutorial" page has been changed by TejasPatil: https://wiki.apache.org/nutch/NutchHadoopTutorial?action=diff&rev1=40&rev2=41 Comment: correction in crawl command (by Wahaj All over @user) to read: +^http://([a-z0-9]*\.)*apache.org/ }}} - We have already added our urls to the distributed filesystem and we have edited our urlfilter so now it is time to begin the crawl. To start the nutch crawl firstly copy your nutch-${version}.job jar over to $HADOOP_HOME, then use the following command: + We have already added our urls to the distributed filesystem and we have edited our urlfilter so now it is time to begin the crawl. To start the nutch crawl firstly copy your apache-nutch-${version}.job jar over to $HADOOP_HOME, then use the following command: {{{ cd $HADOOP_HOME - hadoop jar nutch-${version}.jar org.apache.nutch.crawl.Crawl urls -dir urls -depth 3 -topN 5 + hadoop jar apache-nutch-${version}.job org.apache.nutch.crawl.Crawl urls -dir crawlDir -depth 3 -topN 5 }}} We are using the nutch crawl command. The urls dir is the urls directory that we added to the distributed filesystem. The "-dir crawl" is the output directory. This will also go to the distributed filesystem. The depth is 3 meaning it will only get 3 page links deep. There are other options you can specify, see the command documentation for those options.

