[Nutch Wiki] Update of "NutchTutorial" by RichardBraman

Apache Wiki Tue, 07 Mar 2006 14:07:33 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by RichardBraman:
http://wiki.apache.org/nutch/NutchTutorial

------------------------------------------------------------------------------
     * a ''crawl_parse'' contains the outlink urls, used to update the crawldb
   1. The indexes are Lucene-format indexes.
  
- === Step-by-Step: Seeding the CrawlDB with a list of URLS ===
+ === Step-by-Step: Seeding the Crawl DB with a list of URLS ===
  
  Option 1:  Bootstraping the DMOZ database
  The injector adds urls to the crawldb. Let's inject URLs from the DMOZ Open 
Directory. First we must download and uncompress the file listing all of the 
DMOZ pages. (This is a 200+Mb file, so this will take a few minutes.)
@@ -146, +146 @@

  Before indexing we first invert all of the links, so that we may index 
incoming anchor text with the pages.
  
  {{{ bin/nutch invertlinks crawl/linkdb crawl/segments }}}
- 
+ NOTE: the invertlinks command only applies to Nutch 0.8 and higher.
  To index the segments we use the index command, as follows:
  
  {{{ bin/nutch index indexes crawl/linkdb crawl/segments/* }}}

[Nutch Wiki] Update of "NutchTutorial" by RichardBraman

Reply via email to