[Nutch Wiki] Update of "NutchTutorial" by RichardBraman

Apache Wiki Tue, 07 Mar 2006 14:03:56 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by RichardBraman:
http://wiki.apache.org/nutch/NutchTutorial

------------------------------------------------------------------------------
     * a ''crawl_parse'' contains the outlink urls, used to update the crawldb
   1. The indexes are Lucene-format indexes.
  
- === Step-by-Step: Boostrapping the Web Database ===
+ === Step-by-Step: Seeding the CrawlDB with a list of URLS ===
  
+ Option 1:  Bootstraping the DMOZ database
  The injector adds urls to the crawldb. Let's inject URLs from the DMOZ Open 
Directory. First we must download and uncompress the file listing all of the 
DMOZ pages. (This is a 200+Mb file, so this will take a few minutes.)
  
  {{{ wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
@@ -93, +94 @@

  {{{ bin/nutch inject crawl/crawldb dmoz }}}
  
  Now we have a web database with around 1000 as-yet unfetched URLs in it.
+ 
+ Option 2.  Instead of Bootsrapping DMOZ, we can create a text file called 
urls, this file should have one url per line.  We can initialize the crawl db 
with the selected urls.
+ 
+ {{{ bin/nutch inject crawl/crawldb urls }}}
+ 
  
  === Step-by-Step: Fetching ===

[Nutch Wiki] Update of "NutchTutorial" by RichardBraman

Reply via email to