Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "NutchTutorial" page has been changed by SebastianNagel: http://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=57&rev2=58 Comment: changes for 1.5; harmonized pathes for step-by-step crawling to crawl/{crawldb,segments,linkdb} : should be same as for crawl command (both share the same solrindex section) == Steps == == 1 Setup Nutch from binary distribution == - * Unzip your binary Nutch package to `$HOME/nutch-1.X` + * Unzip your binary Nutch package to `$HOME/nutch-1.X/` - * `cd $HOME/nutch-1.X/runtime/local` + * `cd $HOME/nutch-1.X/` From now on, we are going to use `${NUTCH_RUNTIME_HOME}` to refer to the current directory. @@ -48, +48 @@ }}} * `mkdir -p urls` * `cd urls` - * `touch seed.txt` to create a text file `seed.txt` under `/urls` with the following content (one URL per line for each site you want Nutch to crawl). + * `touch seed.txt` to create a text file `seed.txt` under `urls/` with the following content (one URL per line for each site you want Nutch to crawl). {{{ http://nutch.apache.org/ }}} - * Edit the file `conf/regex-urlfilter.txt` and replace + * Edit the file `conf/regex-urlfilter.txt` and replace {{{ # accept anything else @@ -129, +129 @@ The parser also takes a few minutes, as it must parse the full file. Finally, we initialize the crawldb with the selected URLs. {{{ - bin/nutch inject crawldb dmoz + bin/nutch inject crawl/crawldb dmoz }}} Now we have a Web database with around 1,000 as-yet unfetched URLs in it. @@ -137, +137 @@ This option shadows the creation of the seed list as covered [[#A3._Crawl_your_first_website|here]]. {{{ - bin/nutch inject crawldb urls + bin/nutch inject crawl/crawldb urls }}} ==== Step-by-Step: Fetching ==== To fetch, we first generate a fetch list from the database: {{{ - bin/nutch generate crawldb crawldb/segments + bin/nutch generate crawl/crawldb crawl/segments }}} This generates a fetch list for all of the pages due to be fetched. The fetch list is placed in a newly created segment directory. The segment directory is named by the time it's created. We save the name of this segment in the shell variable {{{s1}}}: {{{ - s1=`ls -d crawldb/segments/2* | tail -1` + s1=`ls -d crawl/segments/2* | tail -1` echo $s1 }}} Now we run the fetcher on this segment with: @@ -166, +166 @@ When this is complete, we update the database with the results of the fetch: {{{ - bin/nutch updatedb crawldb $s1 + bin/nutch updatedb crawl/crawldb $s1 }}} Now the database contains both updated entries for all initial pages as well as new entries that correspond to newly discovered pages linked from the initial set. Now we generate and fetch a new segment containing the top-scoring 1,000 pages: {{{ - bin/nutch generate crawldb crawldb/segments -topN 1000 + bin/nutch generate crawl/crawldb crawl/segments -topN 1000 - s2=`ls -d segments/2* | tail -1` + s2=`ls -d crawl/segments/2* | tail -1` echo $s2 bin/nutch fetch $s2 bin/nutch parse $s2 - bin/nutch updatedb crawldb $s2 + bin/nutch updatedb crawl/crawldb $s2 }}} Let's fetch one more round: {{{ - bin/nutch generate crawldb crawldb/segments -topN 1000 + bin/nutch generate crawl/crawldb crawl/segments -topN 1000 - s3=`ls -d segments/2* | tail -1` + s3=`ls -d crawl/segments/2* | tail -1` echo $s3 bin/nutch fetch $s3 bin/nutch parse $s3 - bin/nutch updatedb crawldb $s3 + bin/nutch updatedb crawl/crawldb $s3 }}} By this point we've fetched a few thousand pages. Let's index them! @@ -198, +198 @@ Before indexing we first invert all of the links, so that we may index incoming anchor text with the pages. {{{ - bin/nutch invertlinks crawldb/linkdb -dir crawldb/segments + bin/nutch invertlinks crawl/linkdb -dir crawl/segments }}} We are now ready to search with Apache Solr. @@ -223, +223 @@ * run the Solr Index command: {{{ - bin/nutch solrindex http://127.0.0.1:8983/solr/ crawldb -linkdb crawldb/linkdb crawldb/segments/* + bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/* }}} The call signature for running the solrindex has changed. The linkdb is now optional, so you need to denote it with a "-linkdb" flag on the command line.

