[Nutch Wiki] Update of "NutchTutorial" by SebastianNagel

Apache Wiki Sun, 10 Jun 2012 12:31:30 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "NutchTutorial" page has been changed by SebastianNagel:
http://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=57&rev2=58

Comment:
changes for 1.5; harmonized pathes for step-by-step crawling to 
crawl/{crawldb,segments,linkdb} : should be same as for crawl command (both 
share the same solrindex section)

  
  == Steps ==
  == 1 Setup Nutch from binary distribution ==
-  * Unzip your binary Nutch package to `$HOME/nutch-1.X`
+  * Unzip your binary Nutch package to `$HOME/nutch-1.X/`
-  * `cd $HOME/nutch-1.X/runtime/local`
+  * `cd $HOME/nutch-1.X/`
  
  From now on, we are going to use `${NUTCH_RUNTIME_HOME}` to refer to the 
current directory.
  
@@ -48, +48 @@

  }}}
   * `mkdir -p urls`
   * `cd urls`
-  * `touch seed.txt` to create a text file `seed.txt` under `/urls` with the 
following content (one URL per line for each site you want Nutch to crawl).
+  * `touch seed.txt` to create a text file `seed.txt` under `urls/` with the 
following content (one URL per line for each site you want Nutch to crawl).
  
  {{{
  http://nutch.apache.org/
  }}}
- * Edit the file `conf/regex-urlfilter.txt` and replace
+  * Edit the file `conf/regex-urlfilter.txt` and replace
  
  {{{
  # accept anything else
@@ -129, +129 @@

  The parser also takes a few minutes, as it must parse the full file. Finally, 
we initialize the crawldb with the selected URLs.
  
  {{{
- bin/nutch inject crawldb dmoz
+ bin/nutch inject crawl/crawldb dmoz
  }}}
  Now we have a Web database with around 1,000 as-yet unfetched URLs in it.
  
@@ -137, +137 @@

  This option shadows the creation of the seed list as covered 
[[#A3._Crawl_your_first_website|here]].
  
  {{{
- bin/nutch inject crawldb urls
+ bin/nutch inject crawl/crawldb urls
  }}}
  ==== Step-by-Step: Fetching ====
  To fetch, we first generate a fetch list from the database:
  
  {{{
- bin/nutch generate crawldb crawldb/segments
+ bin/nutch generate crawl/crawldb crawl/segments
  }}}
  This generates a fetch list for all of the pages due to be fetched. The fetch 
list is placed in a newly created segment directory. The segment directory is 
named by the time it's created. We save the name of this segment in the shell 
variable {{{s1}}}:
  
  {{{
- s1=`ls -d crawldb/segments/2* | tail -1`
+ s1=`ls -d crawl/segments/2* | tail -1`
  echo $s1
  }}}
  Now we run the fetcher on this segment with:
@@ -166, +166 @@

  When this is complete, we update the database with the results of the fetch:
  
  {{{
- bin/nutch updatedb crawldb $s1
+ bin/nutch updatedb crawl/crawldb $s1
  }}}
  Now the database contains both updated entries for all initial pages as well 
as new entries that correspond to newly discovered pages linked from the 
initial set.
  
  Now we generate and fetch a new segment containing the top-scoring 1,000 
pages:
  
  {{{
- bin/nutch generate crawldb crawldb/segments -topN 1000
+ bin/nutch generate crawl/crawldb crawl/segments -topN 1000
- s2=`ls -d segments/2* | tail -1`
+ s2=`ls -d crawl/segments/2* | tail -1`
  echo $s2
  
  bin/nutch fetch $s2
  bin/nutch parse $s2
- bin/nutch updatedb crawldb $s2
+ bin/nutch updatedb crawl/crawldb $s2
  }}}
  Let's fetch one more round:
  
  {{{
- bin/nutch generate crawldb crawldb/segments -topN 1000
+ bin/nutch generate crawl/crawldb crawl/segments -topN 1000
- s3=`ls -d segments/2* | tail -1`
+ s3=`ls -d crawl/segments/2* | tail -1`
  echo $s3
  
  bin/nutch fetch $s3
  bin/nutch parse $s3
- bin/nutch updatedb crawldb $s3
+ bin/nutch updatedb crawl/crawldb $s3
  }}}
  By this point we've fetched a few thousand pages. Let's index them!
  
@@ -198, +198 @@

  Before indexing we first invert all of the links, so that we may index 
incoming anchor text with the pages.
  
  {{{
- bin/nutch invertlinks crawldb/linkdb -dir crawldb/segments
+ bin/nutch invertlinks crawl/linkdb -dir crawl/segments
  }}}
  We are now ready to search with Apache Solr.
  
@@ -223, +223 @@

   * run the Solr Index command:
  
  {{{
- bin/nutch solrindex http://127.0.0.1:8983/solr/ crawldb -linkdb 
crawldb/linkdb crawldb/segments/*
+ bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb 
crawl/linkdb crawl/segments/*
  }}}
  
  The call signature for running the solrindex has changed. The linkdb is now 
optional, so you need to denote it with a "-linkdb" flag on the command line.

[Nutch Wiki] Update of "NutchTutorial" by SebastianNagel

Reply via email to