[Nutch Wiki] Trivial Update of "RunningNutchAndSolr" by LewisJohnMcgibbney

Apache Wiki Fri, 02 Sep 2011 12:46:33 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "RunningNutchAndSolr" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diff&rev1=72&rev2=73

  The parser also takes a few minutes, as it must parse the full file. Finally, 
we initialize the crawl db with the selected urls.
  
  {{{ 
- bin/nutch inject crawl/crawldb dmoz 
+ bin/nutch inject crawldb dmoz 
  }}}
  
  Now we have a web database with around 1000 as-yet unfetched URLs in it.
@@ -137, +137 @@

  ===== Option 2.  Bootstrapping from an initial seed list. =====
  This option shadows the creation of the seed list as covered [[#3. Crawl your 
first website|here]].
  
+ {{{ 
- {{{ bin/nutch inject crawldb urls }}}
+ bin/nutch inject crawldb urls 
+ }}}
  
  ==== Step-by-Step: Fetching ====
  To fetch, we first generate a fetch list from the database:
  
+ {{{ 
- {{{ bin/nutch generate crawldb segments }}}
+ bin/nutch generate crawldb segments 
+ }}}
  
  This generates a fetch list for all of the pages due to be fetched. The fetch 
list is placed in a newly created segment directory. The segment directory is 
named by the time it's created. We save the name of this segment in the shell 
variable {{{s1}}}:
  
@@ -152, +156 @@

  }}}
  Now we run the fetcher on this segment with:
  
+ {{{ 
- {{{ bin/nutch fetch $s1 }}}
+ bin/nutch fetch $s1 
+ }}}
  
  When this is complete, we update the database with the results of the fetch:
  
+ {{{ 
- {{{ bin/nutch updatedb crawldb $s1 }}}
+ bin/nutch updatedb crawldb $s1 
+ }}}
  
  Now the database contains both updated entries for all initial pages as well 
as new entries that correspond to newly discovered pages linked from the 
initial set.
  
  Then we parse the entries:
  
+ {{{ 
- {{{ bin/nutch parse $1 }}}
+ bin/nutch parse $1 
+ }}}
  
  Now we generate and fetch a new segment containing the top-scoring 1000 pages:
  
@@ -191, +201 @@

  ==== Step-by-Step: Invertlinks ====
  Before indexing we first invert all of the links, so that we may index 
incoming anchor text with the pages.
  
+ {{{ 
- {{{ bin/nutch invertlinks linkdb -dir segments }}}
+ bin/nutch invertlinks linkdb -dir segments 
+ }}}
  
  We are now ready to search with Apache Solr.

[Nutch Wiki] Trivial Update of "RunningNutchAndSolr" by LewisJohnMcgibbney

Reply via email to